Re: [PATCH 6/6] hugetlb: update hugetlb documentation for mempolicy based management.

From: Randy Dunlap <randy.dunlap@oracle.com>
To: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
	Mel Gorman <mel@csn.ul.ie>, Nishanth Aravamudan <nacc@us.ibm.com>,
	David Rientjes <rientjes@google.com>,
	linux-numa@vger.kernel.org, Adam Litke <agl@us.ibm.com>,
	Andy Whitcroft <apw@canonical.com>,
	eric.whitney@hp.com
Subject: Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
Date: Thu, 3 Sep 2009 13:42:10 -0700	[thread overview]
Message-ID: <20090903134210.5a27611d.randy.dunlap@oracle.com> (raw)
In-Reply-To: <20090828160351.11080.21379.sendpatchset@localhost.localdomain>

On Fri, 28 Aug 2009 12:03:51 -0400 Lee Schermerhorn wrote:

(Thanks for cc:, David.)

> [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
> 
> Against: 2.6.31-rc7-mmotm-090827-0057
> 
> V2:  Add brief description of per node attributes.
> 
> This patch updates the kernel huge tlb documentation to describe the
> numa memory policy based huge page management.  Additionaly, the patch
> includes a fair amount of rework to improve consistency, eliminate
> duplication and set the context for documenting the memory policy
> interaction.
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  Documentation/vm/hugetlbpage.txt |  257 ++++++++++++++++++++++++++-------------
>  1 file changed, 172 insertions(+), 85 deletions(-)
> 
> Index: linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/Documentation/vm/hugetlbpage.txt	2009-08-28 09:21:16.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt	2009-08-28 09:21:32.000000000 -0400

> @@ -53,26 +51,25 @@ HugePages_Surp  is short for "surplus,"
>  /proc/filesystems should also show a filesystem of type "hugetlbfs" configured
>  in the kernel.
>  
> -/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
> -pages in the kernel.  Super user can dynamically request more (or free some
> -pre-configured) huge pages.
> -The allocation (or deallocation) of hugetlb pages is possible only if there are
> -enough physically contiguous free pages in system (freeing of huge pages is
> -possible only if there are enough hugetlb pages free that can be transferred
> -back to regular memory pool).
> -
> -Pages that are used as hugetlb pages are reserved inside the kernel and cannot
> -be used for other purposes.
> -
> -Once the kernel with Hugetlb page support is built and running, a user can
> -use either the mmap system call or shared memory system calls to start using
> -the huge pages.  It is required that the system administrator preallocate
> -enough memory for huge page purposes.
> -
> -The administrator can preallocate huge pages on the kernel boot command line by
> -specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
> -requested.  This is the most reliable method for preallocating huge pages as
> -memory has not yet become fragmented.
> +/proc/sys/vm/nr_hugepages indicates the current number of huge pages pre-
> +allocated in the kernel's huge page pool.  These are called "persistent"
> +huge pages.  A user with root privileges can dynamically allocate more or
> +free some persistent huge pages by increasing or decreasing the value of
> +'nr_hugepages'.
> +
> +Pages that are used as huge pages are reserved inside the kernel and cannot
> +be used for other purposes.  Huge pages can not be swapped out under

                                           cannot

> +memory pressure.
> +
> +Once a number of huge pages have been pre-allocated to the kernel huge page
> +pool, a user with appropriate privilege can use either the mmap system call
> +or shared memory system calls to use the huge pages.  See the discussion of
> +Using Huge Pages, below

                     below.

> +
> +The administrator can preallocate persistent huge pages on the kernel boot
> +command line by specifying the "hugepages=N" parameter, where 'N' = the
> +number of requested huge pages requested.  This is the most reliable method

drop first "requested"

> +or preallocating huge pages as memory has not yet become fragmented.

   of

>  
>  Some platforms support multiple huge page sizes.  To preallocate huge pages
>  of a specific size, one must preceed the huge pages boot command parameters
> @@ -80,19 +77,24 @@ with a huge page size selection paramete
>  be specified in bytes with optional scale suffix [kKmMgG].  The default huge
>  page size may be selected with the "default_hugepagesz=<size>" boot parameter.
>  
> -/proc/sys/vm/nr_hugepages indicates the current number of configured [default
> -size] hugetlb pages in the kernel.  Super user can dynamically request more
> -(or free some pre-configured) huge pages.
> -
> -Use the following command to dynamically allocate/deallocate default sized
> -huge pages:
> +When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
> +indicates the current number of pre-allocated huge pages of the default size.
> +Thus, one can use the following command to dynamically allocate/deallocate
> +default sized persistent huge pages:
>  
>  	echo 20 > /proc/sys/vm/nr_hugepages
>  
> -This command will try to configure 20 default sized huge pages in the system.
> +This command will try to adjust the number of default sized huge pages in the
> +huge page pool to 20, allocating or freeing huge pages, as required.
> +
>  On a NUMA platform, the kernel will attempt to distribute the huge page pool
> -over the all on-line nodes.  These huge pages, allocated when nr_hugepages
> -is increased, are called "persistent huge pages".
> +over the all the nodes specified by the NUMA memory policy of the task that

drop first "the"

> +modifies nr_hugepages that contain sufficient available contiguous memory.

whoa.  too many "that"s.  confusing.

> +These nodes are called the huge pages "allowed nodes".  The default for the
> +huge pages allowed nodes--when the task has default memory policy--is all
> +on-line nodes.  See the discussion below of the interaction of task memory
> +policy, cpusets and per node attributes with the allocation and freeing of
> +persistent huge pages.
>  
>  The success or failure of huge page allocation depends on the amount of
>  physically contiguous memory that is preset in system at the time of the
> @@ -101,11 +103,11 @@ some nodes in a NUMA system, it will att
...

> @@ -113,39 +115,40 @@ distribution of huge pages in a NUMA sys
>  /proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
>  huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
>  requested by applications.  Writing any non-zero value into this file
> -indicates that the hugetlb subsystem is allowed to try to obtain "surplus"
> -huge pages from the buddy allocator, when the normal pool is exhausted. As
> -these surplus huge pages go out of use, they are freed back to the buddy
> -allocator.
> +indicates that the hugetlb subsystem is allowed to try to obtain that
> +number of "surplus" huge pages from the kernel's normal page pool, when the
> +persistent huge page pool is exhausted. As these surplus huge pages become
> +unused, they are freed back to the kernel's normal page pool.
>  
> -When increasing the huge page pool size via nr_hugepages, any surplus
> +When increasing the huge page pool size via nr_hugepages, any existing surplus
>  pages will first be promoted to persistent huge pages.  Then, additional
>  huge pages will be allocated, if necessary and if possible, to fulfill
> -the new huge page pool size.
> +the new persistent huge page pool size.
>  
>  The administrator may shrink the pool of preallocated huge pages for
>  the default huge page size by setting the nr_hugepages sysctl to a
>  smaller value.  The kernel will attempt to balance the freeing of huge pages
> -across all on-line nodes.  Any free huge pages on the selected nodes will
> -be freed back to the buddy allocator.
> -
> -Caveat: Shrinking the pool via nr_hugepages such that it becomes less
> -than the number of huge pages in use will convert the balance to surplus
> -huge pages even if it would exceed the overcommit value.  As long as
> -this condition holds, however, no more surplus huge pages will be
> -allowed on the system until one of the two sysctls are increased
> -sufficiently, or the surplus huge pages go out of use and are freed.
> +across all nodes in the memory policy of the task modifying nr_hugepages.
> +Any free huge pages on the selected nodes will be freed back to the kernel's
> +normal page pool.
> +
> +Caveat: Shrinking the persistent huge page pool via nr_hugepages such that
> +it becomes less than the number of huge pages in use will convert the balance
> +of the in-use huge pages to surplus huge pages.  This will occur even if

                               surplus allocated huge pages
? vs. surplus available huge pages?

surplus (to me) implies available/unallocated...

Reading more below, I see that "surplus" here means "overcommitted".  oh well ;)

> +the number of surplus pages it would exceed the overcommit value.  As long as
> +this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is
> +increased sufficiently, or the surplus huge pages go out of use and are freed--
> +no more surplus huge pages will be allowed to be allocated.
>  
>  With support for multiple huge page pools at run-time available, much of
> -the huge page userspace interface has been duplicated in sysfs. The above
> -information applies to the default huge page size which will be
> -controlled by the /proc interfaces for backwards compatibility. The root
> -huge page control directory in sysfs is:
> +the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs.
> +The /proc interfaces discussed above have been retained for backwards
> +compatibility. The root huge page control directory in sysfs is:
>  
>  	/sys/kernel/mm/hugepages
>  
>  For each huge page size supported by the running kernel, a subdirectory
> -will exist, of the form
> +will exist, of the form:
>  
>  	hugepages-${size}kB
>  
> @@ -159,6 +162,98 @@ Inside each of these directories, the sa
>  
>  which function as described above for the default huge page-sized case.
>  
> +
> +Interaction of Task Memory Policy with Huge Page Allocation/Freeing:
> +
> +Whether huge pages are allocated and freed via the /proc interface or
> +the /sysfs interface, the NUMA nodes from which huge pages are allocated
> +or freed are controlled by the NUMA memory policy of the task that modifies
> +the nr_hugepages parameter.  [nr_overcommit_hugepages is a global limit.]
> +
> +The recommended method to allocate or free huge pages to/from the kernel
> +huge page pool, using the nr_hugepages example above, is:
> +
> +    numactl --interleave <node-list> echo 20 >/proc/sys/vm/nr_hugepages.

drop '.'

> +
> +or, more succinctly:
> +
> +    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages.

ditto

> +
> +This will allocate or free abs(20 - nr_hugepages) to or from the nodes
> +specified in <node-list>, depending on whether nr_hugepages is initially
> +less than or greater than 20, respectively.  No huge pages will be
> +allocated nor freed on any node not included in the specified <node-list>.
> +
> +Any memory policy mode--bind, preferred, local or interleave--may be
> +used.  The effect on persistent huge page allocation will be as follows:

I would just use present tense as much as possible, e.g.,
                                             allocation is as follows:

> +
> +1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
> +   persistent huge pages will be distributed across the node or nodes
> +   specified in the mempolicy as if "interleave" had been specified.
> +   However, if a node in the policy does not contain sufficient contiguous
> +   memory for a huge page, the allocation will not "fallback" to the nearest
> +   neighbor node with sufficient contiguous memory.  To do this would cause
> +   undesirable imbalance in the distribution of the huge page pool, or
> +   possibly, allocation of persistent huge pages on nodes not allowed by
> +   the task's memory policy.
> +
> +2) One or more nodes may be specified with the bind or interleave policy.
> +   If more than one node is specified with the preferred policy, only the
> +   lowest numeric id will be used.  Local policy will select the node where
> +   the task is running at the time the nodes_allowed mask is constructed.
> +
> +3) For local policy to be deterministic, the task must be bound to a cpu or
> +   cpus in a single node.  Otherwise, the task could be migrated to some

I prefer s/cpu/CPU/ in all of Documentation/ text, but the cat is already out
of the bag on that.

> +   other node at any time after launch and the resulting node will be
> +   indeterminate.  Thus, local policy is not very useful for this purpose.
> +   Any of the other mempolicy modes may be used to specify a single node.
> +
> +4) The nodes allowed mask will be derived from any non-default task mempolicy,
> +   whether this policy was set explicitly by the task itself or one of its
> +   ancestors, such as numactl.  This means that if the task is invoked from a
> +   shell with non-default policy, that policy will be used.  One can specify a
> +   node list of "all" with numactl --interleave or --membind [-m] to achieve
> +   interleaving over all nodes in the system or cpuset.
> +
> +5) Any task mempolicy specifed--e.g., using numactl--will be constrained by
> +   the resource limits of any cpuset in which the task runs.  Thus, there will
> +   be no way for a task with non-default policy running in a cpuset with a
> +   subset of the system nodes to allocate huge pages outside the cpuset
> +   without first moving to a cpuset that contains all of the desired nodes.
> +
> +6) Hugepages allocated at boot time always use the node_online_map.
> +
> +
> +Per Node Hugepages Attributes
> +
> +A subset of the contents of the root huge page control directory in sysfs,
> +described above, has been replicated under each "node" system device in:
> +
> +	/sys/devices/system/node/node[0-9]*/hugepages/
> +
> +Under this directory, the subdirectory for each supported huge page size
> +contains the following attribute files:
> +
> +	nr_hugepages
> +	free_hugepages
> +	surplus_hugepages
> +
> +The free_' and surplus_' attribute files are read-only.  They return the number
> +of free and surplus [overcommitted] huge pages, respectively, on the parent
> +node.
> +
> +The nr_hugepages attribute will return the total number of huge pages on the
> +specified node.  When this attribute is written, the number of persistent huge
> +pages on the parent node will be adjusted to the specified value, if sufficient
> +resources exist, regardless of the task's mempolicy or cpuset constraints.
> +
> +Note that the number of overcommit and reserve pages remain global quantities,
> +as we don't know until fault time, when the faulting task's mempolicy is applied,
> +from which node the huge page allocation will be attempted.
> +
> +
> +Using Huge Pages:
> +
>  If the user applications are going to request huge pages using mmap system
>  call, then it is required that system administrator mount a file system of
>  type hugetlbfs:
> @@ -206,9 +301,11 @@ map_hugetlb.c.
...

> @@ -237,14 +334,8 @@ map_hugetlb.c.
...

> @@ -302,10 +393,12 @@ int main(void)
...

> @@ -317,14 +410,8 @@ int main(void)
...

---
~Randy
LPC 2009, Sept. 23-25, Portland, Oregon
http://linuxplumbersconf.org/2009/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>