All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mel@csn.ul.ie>
To: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
	linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	Nishanth Aravamudan <nacc@us.ibm.com>,
	linux-numa@vger.kernel.org, Adam Litke <agl@us.ibm.com>,
	Andy Whitcroft <apw@canonical.com>,
	Eric Whitney <eric.whitney@hp.com>,
	Randy Dunlap <randy.dunlap@oracle.com>
Subject: Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
Date: Tue, 8 Sep 2009 22:41:10 +0100	[thread overview]
Message-ID: <20090908214109.GB6481@csn.ul.ie> (raw)
In-Reply-To: <alpine.DEB.1.00.0909081307100.13678@chino.kir.corp.google.com>

On Tue, Sep 08, 2009 at 01:18:01PM -0700, David Rientjes wrote:
> On Tue, 8 Sep 2009, Mel Gorman wrote:
> 
> > > Au contraire, the hugepages= kernel parameter is not restricted to any 
> > > mempolicy.
> > > 
> > 
> > I'm not seeing how it would be considered symmetric to compare allocation
> > at a boot-time parameter with freeing happening at run-time within a mempolicy.
> > It's more plausible to me that such a scenario will having the freeing
> > thread either with no policy or the ability to run with no policy
> > applied.
> > 
> 
> Imagine a cluster of machines that are all treated equally to serve a 
> variety of different production jobs.  One of those production jobs 
> requires a very high percentage of hugepages.  In fact, its performance 
> gain is directly proportional to the number of hugepages allocated.
> 
> It is quite plausible for all machines to be booted with hugepages= to 
> achieve the maximum number of hugepages that those machines may support.  
> Depending on what jobs they will serve, however, those hugepages may 
> immediately be freed (or a subset, depending on other smaller jobs that 
> may want them.)  If the job scheduler is bound to a mempolicy which does 
> not include all nodes with memory, those hugepages are now leaked. 

Why is a job scheduler that is expecting to affect memory on a global
basis running inside a mempolicy that restricts it to a subset of nodes?
It seems inconsistent that an isolated job starting could affect the global
state potentially affecting other jobs starting up.

In addition, if it is the case that the jobs performance is directly
proportional to the number of hugepages it gets access to, why is it starting
up with access to only a subset of the available hugepages? Why is it not
being setup to being the first job to start on a freshly booting machine,
starting on the subset of nodes allowed and requesting the maximum number
of hugepages it needs such that it achieves maximum performance? With the
memory policy approach, it's very straight-forward to do this because all
it has to do is write to nr_hugepages when it starts-up.

> That 
> was not the behavior over the past three or four years until this 
> patchset.
> 

While this is true, I know people have also been bitten by the expectation
that writing to nr_hugepages would obey a memory policy and were surprised
when it didn't happen and sent me whinging emails. It also appeared obvious
to me that it's how the interface should behave even if it wasn't doing it
in practice. Once nr_hugepages obeys memory policies, it's fairly convenient
to size the number of pages on a subset of nodes using numactl - a tool that
people would generally expect to be used when operating on nodes. Hence the
example usage being

numactl -m x,y,z hugeadm --pool-pages-min $PAGESIZE:$NUMPAGES

> That example is not dealing in hypotheticals or assumptions on how people 
> use hugepages, it's based on reality.  As I said previously, I don't 
> necessarily have an objection to that if it can be shown that the 
> advantages significantly outweigh the disadvantages.  I'm not sure I see 
> the advantage in being implict vs. explicit, however. 

The advantage is that with memory policies on nr_hugepages, it's very
convenient to allocate pages within a subset of nodes without worrying about
where exactly those huge pages are being allocated from. It will allocate
them on a round-robin basis allocating more pages on one node over another
if fragmentation requires it rather than shifting the burden to a userspace
application figuring out what nodes might succeed an allocation or shifting
the burden onto the system administrator. It's likely that writing to the
global nr_hugepages within a mempolicy will end up with a more sensible
result than a userspace application dealing with the individual node-specific
nr_hugepages files.

To do the same with the explicit interface, a userspace application
or administrator would have to keep reading the existing nr_hugepages,
writing existing_nr_hugepages+1 to each node in the allowed set, re-reading
to check for allocating failure and round-robining by hand.  This seems
awkward-for-the-sake-of-being-awkward when the kernel is already prefectly
aware of how to round-robin allocate the requested number of nodes allocating
more on one node if necessary.

> Mempolicy 
> allocation and freeing is now _implicit_ because its restricted to 
> current's mempolicy when it wasn't before, yet node-targeted hugepage 
> allocation and freeing is _explicit_ because it's a new interface and on 
> the same granularity.
> 

Arguably because the application was restricted by a memory policy, it
should not be able to operating outside of that policy and be forbidden
from writing to per-node-nr_hugepages outside the allowed set.  However,
that would appear awkward for the sake of it.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Mel Gorman <mel@csn.ul.ie>
To: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
	linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	Nishanth Aravamudan <nacc@us.ibm.com>,
	linux-numa@vger.kernel.org, Adam Litke <agl@us.ibm.com>,
	Andy Whitcroft <apw@canonical.com>,
	Eric Whitney <eric.whitney@hp.com>,
	Randy Dunlap <randy.dunlap@oracle.com>
Subject: Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
Date: Tue, 8 Sep 2009 22:41:10 +0100	[thread overview]
Message-ID: <20090908214109.GB6481@csn.ul.ie> (raw)
In-Reply-To: <alpine.DEB.1.00.0909081307100.13678@chino.kir.corp.google.com>

On Tue, Sep 08, 2009 at 01:18:01PM -0700, David Rientjes wrote:
> On Tue, 8 Sep 2009, Mel Gorman wrote:
> 
> > > Au contraire, the hugepages= kernel parameter is not restricted to any 
> > > mempolicy.
> > > 
> > 
> > I'm not seeing how it would be considered symmetric to compare allocation
> > at a boot-time parameter with freeing happening at run-time within a mempolicy.
> > It's more plausible to me that such a scenario will having the freeing
> > thread either with no policy or the ability to run with no policy
> > applied.
> > 
> 
> Imagine a cluster of machines that are all treated equally to serve a 
> variety of different production jobs.  One of those production jobs 
> requires a very high percentage of hugepages.  In fact, its performance 
> gain is directly proportional to the number of hugepages allocated.
> 
> It is quite plausible for all machines to be booted with hugepages= to 
> achieve the maximum number of hugepages that those machines may support.  
> Depending on what jobs they will serve, however, those hugepages may 
> immediately be freed (or a subset, depending on other smaller jobs that 
> may want them.)  If the job scheduler is bound to a mempolicy which does 
> not include all nodes with memory, those hugepages are now leaked. 

Why is a job scheduler that is expecting to affect memory on a global
basis running inside a mempolicy that restricts it to a subset of nodes?
It seems inconsistent that an isolated job starting could affect the global
state potentially affecting other jobs starting up.

In addition, if it is the case that the jobs performance is directly
proportional to the number of hugepages it gets access to, why is it starting
up with access to only a subset of the available hugepages? Why is it not
being setup to being the first job to start on a freshly booting machine,
starting on the subset of nodes allowed and requesting the maximum number
of hugepages it needs such that it achieves maximum performance? With the
memory policy approach, it's very straight-forward to do this because all
it has to do is write to nr_hugepages when it starts-up.

> That 
> was not the behavior over the past three or four years until this 
> patchset.
> 

While this is true, I know people have also been bitten by the expectation
that writing to nr_hugepages would obey a memory policy and were surprised
when it didn't happen and sent me whinging emails. It also appeared obvious
to me that it's how the interface should behave even if it wasn't doing it
in practice. Once nr_hugepages obeys memory policies, it's fairly convenient
to size the number of pages on a subset of nodes using numactl - a tool that
people would generally expect to be used when operating on nodes. Hence the
example usage being

numactl -m x,y,z hugeadm --pool-pages-min $PAGESIZE:$NUMPAGES

> That example is not dealing in hypotheticals or assumptions on how people 
> use hugepages, it's based on reality.  As I said previously, I don't 
> necessarily have an objection to that if it can be shown that the 
> advantages significantly outweigh the disadvantages.  I'm not sure I see 
> the advantage in being implict vs. explicit, however. 

The advantage is that with memory policies on nr_hugepages, it's very
convenient to allocate pages within a subset of nodes without worrying about
where exactly those huge pages are being allocated from. It will allocate
them on a round-robin basis allocating more pages on one node over another
if fragmentation requires it rather than shifting the burden to a userspace
application figuring out what nodes might succeed an allocation or shifting
the burden onto the system administrator. It's likely that writing to the
global nr_hugepages within a mempolicy will end up with a more sensible
result than a userspace application dealing with the individual node-specific
nr_hugepages files.

To do the same with the explicit interface, a userspace application
or administrator would have to keep reading the existing nr_hugepages,
writing existing_nr_hugepages+1 to each node in the allowed set, re-reading
to check for allocating failure and round-robining by hand.  This seems
awkward-for-the-sake-of-being-awkward when the kernel is already prefectly
aware of how to round-robin allocate the requested number of nodes allocating
more on one node if necessary.

> Mempolicy 
> allocation and freeing is now _implicit_ because its restricted to 
> current's mempolicy when it wasn't before, yet node-targeted hugepage 
> allocation and freeing is _explicit_ because it's a new interface and on 
> the same granularity.
> 

Arguably because the application was restricted by a memory policy, it
should not be able to operating outside of that policy and be forbidden
from writing to per-node-nr_hugepages outside the allowed set.  However,
that would appear awkward for the sake of it.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

  reply	other threads:[~2009-09-08 21:41 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-08-28 16:03 [PATCH 0/6] hugetlb: V5 constrain allocation/free based on task mempolicy Lee Schermerhorn
2009-08-28 16:03 ` Lee Schermerhorn
2009-08-28 16:03 ` [PATCH 1/6] hugetlb: rework hstate_next_node_* functions Lee Schermerhorn
2009-08-28 16:03   ` Lee Schermerhorn
2009-08-28 16:03 ` [PATCH 2/6] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns Lee Schermerhorn
2009-08-28 16:03   ` Lee Schermerhorn
2009-09-03 18:39   ` David Rientjes
2009-09-03 18:39     ` David Rientjes
2009-08-28 16:03 ` [PATCH 3/6] hugetlb: derive huge pages nodes allowed from task mempolicy Lee Schermerhorn
2009-08-28 16:03   ` Lee Schermerhorn
2009-09-01 14:47   ` Mel Gorman
2009-09-01 14:47     ` Mel Gorman
2009-09-03 19:22   ` David Rientjes
2009-09-03 19:22     ` David Rientjes
2009-09-03 20:15     ` Lee Schermerhorn
2009-09-03 20:15       ` Lee Schermerhorn
2009-09-03 20:49       ` David Rientjes
2009-09-03 20:49         ` David Rientjes
2009-08-28 16:03 ` [PATCH 4/6] hugetlb: introduce alloc_nodemask_of_node Lee Schermerhorn
2009-08-28 16:03   ` Lee Schermerhorn
2009-09-01 14:49   ` Mel Gorman
2009-09-01 14:49     ` Mel Gorman
2009-09-01 16:42     ` Lee Schermerhorn
2009-09-01 16:42       ` Lee Schermerhorn
2009-09-03 18:34       ` David Rientjes
2009-09-03 18:34         ` David Rientjes
2009-09-03 20:49         ` Lee Schermerhorn
2009-09-03 21:03           ` David Rientjes
2009-09-03 21:03             ` David Rientjes
2009-08-28 16:03 ` [PATCH 5/6] hugetlb: add per node hstate attributes Lee Schermerhorn
2009-08-28 16:03   ` Lee Schermerhorn
2009-09-01 15:20   ` Mel Gorman
2009-09-01 15:20     ` Mel Gorman
2009-09-03 19:52   ` David Rientjes
2009-09-03 19:52     ` David Rientjes
2009-09-03 20:41     ` Lee Schermerhorn
2009-09-03 20:41       ` Lee Schermerhorn
2009-09-03 21:02       ` David Rientjes
2009-09-03 21:02         ` David Rientjes
2009-09-04 14:30         ` Lee Schermerhorn
2009-09-04 14:30           ` Lee Schermerhorn
2009-08-28 16:03 ` [PATCH 6/6] hugetlb: update hugetlb documentation for mempolicy based management Lee Schermerhorn
2009-08-28 16:03   ` Lee Schermerhorn
2009-09-03 20:07   ` David Rientjes
2009-09-03 20:07     ` David Rientjes
2009-09-03 21:09     ` Lee Schermerhorn
2009-09-03 21:09       ` Lee Schermerhorn
2009-09-03 21:25       ` David Rientjes
2009-09-08 10:44         ` Mel Gorman
2009-09-08 10:44           ` Mel Gorman
2009-09-08 19:51           ` David Rientjes
2009-09-08 20:04             ` Mel Gorman
2009-09-08 20:04               ` Mel Gorman
2009-09-08 20:18               ` David Rientjes
2009-09-08 21:41                 ` Mel Gorman [this message]
2009-09-08 21:41                   ` Mel Gorman
2009-09-08 22:54                   ` David Rientjes
2009-09-09  8:16                     ` Mel Gorman
2009-09-09  8:16                       ` Mel Gorman
2009-09-09 20:44                       ` David Rientjes
2009-09-10 12:26                         ` Mel Gorman
2009-09-10 12:26                           ` Mel Gorman
2009-09-11 22:27                           ` David Rientjes
2009-09-11 22:27                             ` David Rientjes
2009-09-14 13:33                             ` Mel Gorman
2009-09-14 14:15                               ` Lee Schermerhorn
2009-09-14 14:15                                 ` Lee Schermerhorn
2009-09-14 15:41                                 ` Mel Gorman
2009-09-14 15:41                                   ` Mel Gorman
2009-09-14 19:15                                   ` David Rientjes
2009-09-14 19:15                                     ` David Rientjes
2009-09-15 11:48                                     ` Mel Gorman
2009-09-15 11:48                                       ` Mel Gorman
2009-09-14 19:14                               ` David Rientjes
2009-09-14 19:14                                 ` David Rientjes
2009-09-14 21:28                                 ` David Rientjes
2009-09-16 10:21                                   ` Mel Gorman
2009-09-03 20:42   ` Randy Dunlap
2009-09-04 15:23     ` Lee Schermerhorn
2009-09-09 16:31 [PATCH 0/6] hugetlb: V6 constrain allocation/free based on task mempolicy Lee Schermerhorn
2009-09-09 16:32 ` [PATCH 6/6] hugetlb: update hugetlb documentation for mempolicy based management Lee Schermerhorn
2009-09-09 16:32   ` Lee Schermerhorn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090908214109.GB6481@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=agl@us.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=apw@canonical.com \
    --cc=eric.whitney@hp.com \
    --cc=linux-mm@kvack.org \
    --cc=linux-numa@vger.kernel.org \
    --cc=nacc@us.ibm.com \
    --cc=randy.dunlap@oracle.com \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.