All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] hugetlb: V5 constrain allocation/free based on task mempolicy
@ 2009-08-28 16:03 ` Lee Schermerhorn
  0 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-08-28 16:03 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

PATCH 0/6 hugetlb: numa control of persistent huge pages alloc/free

Against:  2.6.31-rc7-mmotm-090827-0057

This is V5 of a series of patches to provide control over the location
of the allocation and freeing of persistent huge pages on a NUMA
platform.

This series uses the task NUMA mempolicy of the task modifying
"nr_hugepages" to constrain the affected nodes.  This method is
based on Mel Gorman's suggestion to use task mempolicy.  One of
the benefits of this method is that it does not *require*
modification to hugeadm(8) to use this feature.  One of the possible
downsides is that task mempolicy is limited by cpuset constraints.

V4 added a subset of the hugepages sysfs attributes to each per
node system device directory under:

	/sys/devices/node/node[0-9]*/hugepages.

The per node attibutes allow direct assignment of a huge page
count on a specific node, regardless of the task's mempolicy or
cpuset constraints.

V5 addresses review comments -- changes described in patch
descriptions.  Should be almost ready for -mm?

Note, I haven't implemented a boot time parameter to constrain the
boot time allocation of huge pages.  This can be added if anyone feels
strongly that it is required.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 0/6] hugetlb: V5 constrain allocation/free based on task mempolicy
@ 2009-08-28 16:03 ` Lee Schermerhorn
  0 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-08-28 16:03 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

PATCH 0/6 hugetlb: numa control of persistent huge pages alloc/free

Against:  2.6.31-rc7-mmotm-090827-0057

This is V5 of a series of patches to provide control over the location
of the allocation and freeing of persistent huge pages on a NUMA
platform.

This series uses the task NUMA mempolicy of the task modifying
"nr_hugepages" to constrain the affected nodes.  This method is
based on Mel Gorman's suggestion to use task mempolicy.  One of
the benefits of this method is that it does not *require*
modification to hugeadm(8) to use this feature.  One of the possible
downsides is that task mempolicy is limited by cpuset constraints.

V4 added a subset of the hugepages sysfs attributes to each per
node system device directory under:

	/sys/devices/node/node[0-9]*/hugepages.

The per node attibutes allow direct assignment of a huge page
count on a specific node, regardless of the task's mempolicy or
cpuset constraints.

V5 addresses review comments -- changes described in patch
descriptions.  Should be almost ready for -mm?

Note, I haven't implemented a boot time parameter to constrain the
boot time allocation of huge pages.  This can be added if anyone feels
strongly that it is required.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 1/6] hugetlb:  rework hstate_next_node_* functions
  2009-08-28 16:03 ` Lee Schermerhorn
@ 2009-08-28 16:03   ` Lee Schermerhorn
  -1 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-08-28 16:03 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 1/6] hugetlb:  rework hstate_next_node* functions

Against:  2.6.31-rc7-mmotm-090827-0057

V2:
+ cleaned up comments, removed some deemed unnecessary,
  add some suggested by review
+ removed check for !current in huge_mpol_nodes_allowed().
+ added 'current->comm' to warning message in huge_mpol_nodes_allowed().
+ added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to
  catch out of range node id.
+ add examples to patch description

V3:
+ factored this "cleanup" patch out of V2 patch 2/3
+ moved ahead of patch to add nodes_allowed mask to alloc funcs
  as this patch is somewhat independent from using task mempolicy
  to control huge page allocation and freeing.

Modify the hstate_next_node* functions to allow them to be called to
obtain the "start_nid".  Then, whereas prior to this patch we
unconditionally called hstate_next_node_to_{alloc|free}(), whether
or not we successfully allocated/freed a huge page on the node,
now we only call these functions on failure to alloc/free to advance
to next allowed node.

Factor out the next_node_allowed() function to handle wrap at end
of node_online_map.  In this version, the allowed nodes include all 
of the online nodes.

Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/hugetlb.c |   70 +++++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 45 insertions(+), 25 deletions(-)

Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/hugetlb.c	2009-08-27 13:12:12.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c	2009-08-28 09:21:21.000000000 -0400
@@ -622,6 +622,20 @@ static struct page *alloc_fresh_huge_pag
 }
 
 /*
+ * common helper function for hstate_next_node_to_{alloc|free}.
+ * return next node in node_online_map, wrapping at end.
+ */
+static int next_node_allowed(int nid)
+{
+	nid = next_node(nid, node_online_map);
+	if (nid == MAX_NUMNODES)
+		nid = first_node(node_online_map);
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+
+	return nid;
+}
+
+/*
  * Use a helper variable to find the next node and then
  * copy it back to next_nid_to_alloc afterwards:
  * otherwise there's a window in which a racer might
@@ -634,12 +648,12 @@ static struct page *alloc_fresh_huge_pag
  */
 static int hstate_next_node_to_alloc(struct hstate *h)
 {
-	int next_nid;
-	next_nid = next_node(h->next_nid_to_alloc, node_online_map);
-	if (next_nid == MAX_NUMNODES)
-		next_nid = first_node(node_online_map);
+	int nid, next_nid;
+
+	nid = h->next_nid_to_alloc;
+	next_nid = next_node_allowed(nid);
 	h->next_nid_to_alloc = next_nid;
-	return next_nid;
+	return nid;
 }
 
 static int alloc_fresh_huge_page(struct hstate *h)
@@ -649,15 +663,17 @@ static int alloc_fresh_huge_page(struct
 	int next_nid;
 	int ret = 0;
 
-	start_nid = h->next_nid_to_alloc;
+	start_nid = hstate_next_node_to_alloc(h);
 	next_nid = start_nid;
 
 	do {
 		page = alloc_fresh_huge_page_node(h, next_nid);
-		if (page)
+		if (page) {
 			ret = 1;
+			break;
+		}
 		next_nid = hstate_next_node_to_alloc(h);
-	} while (!page && next_nid != start_nid);
+	} while (next_nid != start_nid);
 
 	if (ret)
 		count_vm_event(HTLB_BUDDY_PGALLOC);
@@ -668,17 +684,19 @@ static int alloc_fresh_huge_page(struct
 }
 
 /*
- * helper for free_pool_huge_page() - find next node
- * from which to free a huge page
+ * helper for free_pool_huge_page() - return the next node
+ * from which to free a huge page.  Advance the next node id
+ * whether or not we find a free huge page to free so that the
+ * next attempt to free addresses the next node.
  */
 static int hstate_next_node_to_free(struct hstate *h)
 {
-	int next_nid;
-	next_nid = next_node(h->next_nid_to_free, node_online_map);
-	if (next_nid == MAX_NUMNODES)
-		next_nid = first_node(node_online_map);
+	int nid, next_nid;
+
+	nid = h->next_nid_to_free;
+	next_nid = next_node_allowed(nid);
 	h->next_nid_to_free = next_nid;
-	return next_nid;
+	return nid;
 }
 
 /*
@@ -693,7 +711,7 @@ static int free_pool_huge_page(struct hs
 	int next_nid;
 	int ret = 0;
 
-	start_nid = h->next_nid_to_free;
+	start_nid = hstate_next_node_to_free(h);
 	next_nid = start_nid;
 
 	do {
@@ -715,9 +733,10 @@ static int free_pool_huge_page(struct hs
 			}
 			update_and_free_page(h, page);
 			ret = 1;
+			break;
 		}
 		next_nid = hstate_next_node_to_free(h);
-	} while (!ret && next_nid != start_nid);
+	} while (next_nid != start_nid);
 
 	return ret;
 }
@@ -1028,10 +1047,9 @@ int __weak alloc_bootmem_huge_page(struc
 		void *addr;
 
 		addr = __alloc_bootmem_node_nopanic(
-				NODE_DATA(h->next_nid_to_alloc),
+				NODE_DATA(hstate_next_node_to_alloc(h)),
 				huge_page_size(h), huge_page_size(h), 0);
 
-		hstate_next_node_to_alloc(h);
 		if (addr) {
 			/*
 			 * Use the beginning of the huge page to store the
@@ -1167,29 +1185,31 @@ static int adjust_pool_surplus(struct hs
 	VM_BUG_ON(delta != -1 && delta != 1);
 
 	if (delta < 0)
-		start_nid = h->next_nid_to_alloc;
+		start_nid = hstate_next_node_to_alloc(h);
 	else
-		start_nid = h->next_nid_to_free;
+		start_nid = hstate_next_node_to_free(h);
 	next_nid = start_nid;
 
 	do {
 		int nid = next_nid;
 		if (delta < 0)  {
-			next_nid = hstate_next_node_to_alloc(h);
 			/*
 			 * To shrink on this node, there must be a surplus page
 			 */
-			if (!h->surplus_huge_pages_node[nid])
+			if (!h->surplus_huge_pages_node[nid]) {
+				next_nid = hstate_next_node_to_alloc(h);
 				continue;
+			}
 		}
 		if (delta > 0) {
-			next_nid = hstate_next_node_to_free(h);
 			/*
 			 * Surplus cannot exceed the total number of pages
 			 */
 			if (h->surplus_huge_pages_node[nid] >=
-						h->nr_huge_pages_node[nid])
+						h->nr_huge_pages_node[nid]) {
+				next_nid = hstate_next_node_to_free(h);
 				continue;
+			}
 		}
 
 		h->surplus_huge_pages += delta;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 1/6] hugetlb:  rework hstate_next_node_* functions
@ 2009-08-28 16:03   ` Lee Schermerhorn
  0 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-08-28 16:03 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 1/6] hugetlb:  rework hstate_next_node* functions

Against:  2.6.31-rc7-mmotm-090827-0057

V2:
+ cleaned up comments, removed some deemed unnecessary,
  add some suggested by review
+ removed check for !current in huge_mpol_nodes_allowed().
+ added 'current->comm' to warning message in huge_mpol_nodes_allowed().
+ added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to
  catch out of range node id.
+ add examples to patch description

V3:
+ factored this "cleanup" patch out of V2 patch 2/3
+ moved ahead of patch to add nodes_allowed mask to alloc funcs
  as this patch is somewhat independent from using task mempolicy
  to control huge page allocation and freeing.

Modify the hstate_next_node* functions to allow them to be called to
obtain the "start_nid".  Then, whereas prior to this patch we
unconditionally called hstate_next_node_to_{alloc|free}(), whether
or not we successfully allocated/freed a huge page on the node,
now we only call these functions on failure to alloc/free to advance
to next allowed node.

Factor out the next_node_allowed() function to handle wrap at end
of node_online_map.  In this version, the allowed nodes include all 
of the online nodes.

Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/hugetlb.c |   70 +++++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 45 insertions(+), 25 deletions(-)

Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/hugetlb.c	2009-08-27 13:12:12.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c	2009-08-28 09:21:21.000000000 -0400
@@ -622,6 +622,20 @@ static struct page *alloc_fresh_huge_pag
 }
 
 /*
+ * common helper function for hstate_next_node_to_{alloc|free}.
+ * return next node in node_online_map, wrapping at end.
+ */
+static int next_node_allowed(int nid)
+{
+	nid = next_node(nid, node_online_map);
+	if (nid == MAX_NUMNODES)
+		nid = first_node(node_online_map);
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+
+	return nid;
+}
+
+/*
  * Use a helper variable to find the next node and then
  * copy it back to next_nid_to_alloc afterwards:
  * otherwise there's a window in which a racer might
@@ -634,12 +648,12 @@ static struct page *alloc_fresh_huge_pag
  */
 static int hstate_next_node_to_alloc(struct hstate *h)
 {
-	int next_nid;
-	next_nid = next_node(h->next_nid_to_alloc, node_online_map);
-	if (next_nid == MAX_NUMNODES)
-		next_nid = first_node(node_online_map);
+	int nid, next_nid;
+
+	nid = h->next_nid_to_alloc;
+	next_nid = next_node_allowed(nid);
 	h->next_nid_to_alloc = next_nid;
-	return next_nid;
+	return nid;
 }
 
 static int alloc_fresh_huge_page(struct hstate *h)
@@ -649,15 +663,17 @@ static int alloc_fresh_huge_page(struct
 	int next_nid;
 	int ret = 0;
 
-	start_nid = h->next_nid_to_alloc;
+	start_nid = hstate_next_node_to_alloc(h);
 	next_nid = start_nid;
 
 	do {
 		page = alloc_fresh_huge_page_node(h, next_nid);
-		if (page)
+		if (page) {
 			ret = 1;
+			break;
+		}
 		next_nid = hstate_next_node_to_alloc(h);
-	} while (!page && next_nid != start_nid);
+	} while (next_nid != start_nid);
 
 	if (ret)
 		count_vm_event(HTLB_BUDDY_PGALLOC);
@@ -668,17 +684,19 @@ static int alloc_fresh_huge_page(struct
 }
 
 /*
- * helper for free_pool_huge_page() - find next node
- * from which to free a huge page
+ * helper for free_pool_huge_page() - return the next node
+ * from which to free a huge page.  Advance the next node id
+ * whether or not we find a free huge page to free so that the
+ * next attempt to free addresses the next node.
  */
 static int hstate_next_node_to_free(struct hstate *h)
 {
-	int next_nid;
-	next_nid = next_node(h->next_nid_to_free, node_online_map);
-	if (next_nid == MAX_NUMNODES)
-		next_nid = first_node(node_online_map);
+	int nid, next_nid;
+
+	nid = h->next_nid_to_free;
+	next_nid = next_node_allowed(nid);
 	h->next_nid_to_free = next_nid;
-	return next_nid;
+	return nid;
 }
 
 /*
@@ -693,7 +711,7 @@ static int free_pool_huge_page(struct hs
 	int next_nid;
 	int ret = 0;
 
-	start_nid = h->next_nid_to_free;
+	start_nid = hstate_next_node_to_free(h);
 	next_nid = start_nid;
 
 	do {
@@ -715,9 +733,10 @@ static int free_pool_huge_page(struct hs
 			}
 			update_and_free_page(h, page);
 			ret = 1;
+			break;
 		}
 		next_nid = hstate_next_node_to_free(h);
-	} while (!ret && next_nid != start_nid);
+	} while (next_nid != start_nid);
 
 	return ret;
 }
@@ -1028,10 +1047,9 @@ int __weak alloc_bootmem_huge_page(struc
 		void *addr;
 
 		addr = __alloc_bootmem_node_nopanic(
-				NODE_DATA(h->next_nid_to_alloc),
+				NODE_DATA(hstate_next_node_to_alloc(h)),
 				huge_page_size(h), huge_page_size(h), 0);
 
-		hstate_next_node_to_alloc(h);
 		if (addr) {
 			/*
 			 * Use the beginning of the huge page to store the
@@ -1167,29 +1185,31 @@ static int adjust_pool_surplus(struct hs
 	VM_BUG_ON(delta != -1 && delta != 1);
 
 	if (delta < 0)
-		start_nid = h->next_nid_to_alloc;
+		start_nid = hstate_next_node_to_alloc(h);
 	else
-		start_nid = h->next_nid_to_free;
+		start_nid = hstate_next_node_to_free(h);
 	next_nid = start_nid;
 
 	do {
 		int nid = next_nid;
 		if (delta < 0)  {
-			next_nid = hstate_next_node_to_alloc(h);
 			/*
 			 * To shrink on this node, there must be a surplus page
 			 */
-			if (!h->surplus_huge_pages_node[nid])
+			if (!h->surplus_huge_pages_node[nid]) {
+				next_nid = hstate_next_node_to_alloc(h);
 				continue;
+			}
 		}
 		if (delta > 0) {
-			next_nid = hstate_next_node_to_free(h);
 			/*
 			 * Surplus cannot exceed the total number of pages
 			 */
 			if (h->surplus_huge_pages_node[nid] >=
-						h->nr_huge_pages_node[nid])
+						h->nr_huge_pages_node[nid]) {
+				next_nid = hstate_next_node_to_free(h);
 				continue;
+			}
 		}
 
 		h->surplus_huge_pages += delta;

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 2/6] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns
  2009-08-28 16:03 ` Lee Schermerhorn
@ 2009-08-28 16:03   ` Lee Schermerhorn
  -1 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-08-28 16:03 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 2/6] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns

Against:  2.6.31-rc7-mmotm-090827-0057

V3:
+ moved this patch to after the "rework" of hstate_next_node_to_...
  functions as this patch is more specific to using task mempolicy
  to control huge page allocation and freeing.

V5:
+ removed now unneeded 'nextnid' from hstate_next_node_to_{alloc|free}
  and updated the stale comments.

In preparation for constraining huge page allocation and freeing by the
controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer
to the allocate, free and surplus adjustment functions.  For now, pass
NULL to indicate default behavior--i.e., use node_online_map.  A
subsqeuent patch will derive a non-default mask from the controlling 
task's numa mempolicy.

Note that this method of updating the global hstate nr_hugepages under
the constraint of a nodemask simplifies keeping the global state 
consistent--especially the number of persistent and surplus pages
relative to reservations and overcommit limits.  There are undoubtedly
other ways to do this, but this works for both interfaces:  mempolicy
and per node attributes.

Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/hugetlb.c |  121 +++++++++++++++++++++++++++++++++++------------------------
 1 file changed, 72 insertions(+), 49 deletions(-)

Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/hugetlb.c	2009-08-28 09:21:21.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c	2009-08-28 09:21:26.000000000 -0400
@@ -622,48 +622,57 @@ static struct page *alloc_fresh_huge_pag
 }
 
 /*
- * common helper function for hstate_next_node_to_{alloc|free}.
- * return next node in node_online_map, wrapping at end.
+ * common helper functions for hstate_next_node_to_{alloc|free}.
+ * We may have allocated or freed a huge page based on a different
+ * nodes_allowed previously, so h->next_node_to_{alloc|free} might
+ * be outside of *nodes_allowed.  Ensure that we use an allowed
+ * node for alloc or free.
  */
-static int next_node_allowed(int nid)
+static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
 {
-	nid = next_node(nid, node_online_map);
+	nid = next_node(nid, *nodes_allowed);
 	if (nid == MAX_NUMNODES)
-		nid = first_node(node_online_map);
+		nid = first_node(*nodes_allowed);
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 
 	return nid;
 }
 
+static int this_node_allowed(int nid, nodemask_t *nodes_allowed)
+{
+	if (!node_isset(nid, *nodes_allowed))
+		nid = next_node_allowed(nid, nodes_allowed);
+	return nid;
+}
+
 /*
- * Use a helper variable to find the next node and then
- * copy it back to next_nid_to_alloc afterwards:
- * otherwise there's a window in which a racer might
- * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
- * But we don't need to use a spin_lock here: it really
- * doesn't matter if occasionally a racer chooses the
- * same nid as we do.  Move nid forward in the mask even
- * if we just successfully allocated a hugepage so that
- * the next caller gets hugepages on the next node.
+ * returns the previously saved node ["this node"] from which to
+ * allocate a persistent huge page for the pool and advance the
+ * next node from which to allocate, handling wrap at end of node
+ * mask.  'nodes_allowed' defaults to node_online_map.
  */
-static int hstate_next_node_to_alloc(struct hstate *h)
+static int hstate_next_node_to_alloc(struct hstate *h,
+					nodemask_t *nodes_allowed)
 {
-	int nid, next_nid;
+	int nid;
+
+	if (!nodes_allowed)
+		nodes_allowed = &node_online_map;
+
+	nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed);
+	h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
 
-	nid = h->next_nid_to_alloc;
-	next_nid = next_node_allowed(nid);
-	h->next_nid_to_alloc = next_nid;
 	return nid;
 }
 
-static int alloc_fresh_huge_page(struct hstate *h)
+static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
 {
 	struct page *page;
 	int start_nid;
 	int next_nid;
 	int ret = 0;
 
-	start_nid = hstate_next_node_to_alloc(h);
+	start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
 	next_nid = start_nid;
 
 	do {
@@ -672,7 +681,7 @@ static int alloc_fresh_huge_page(struct
 			ret = 1;
 			break;
 		}
-		next_nid = hstate_next_node_to_alloc(h);
+		next_nid = hstate_next_node_to_alloc(h, nodes_allowed);
 	} while (next_nid != start_nid);
 
 	if (ret)
@@ -684,18 +693,21 @@ static int alloc_fresh_huge_page(struct
 }
 
 /*
- * helper for free_pool_huge_page() - return the next node
- * from which to free a huge page.  Advance the next node id
- * whether or not we find a free huge page to free so that the
- * next attempt to free addresses the next node.
+ * helper for free_pool_huge_page() - return the previously saved
+ * node ["this node"] from which to free a huge page.  Advance the
+ * next node id whether or not we find a free huge page to free so
+ * that the next attempt to free addresses the next node.
  */
-static int hstate_next_node_to_free(struct hstate *h)
+static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 {
-	int nid, next_nid;
+	int nid;
+
+	if (!nodes_allowed)
+		nodes_allowed = &node_online_map;
+
+	nid = this_node_allowed(h->next_nid_to_free, nodes_allowed);
+	h->next_nid_to_free = next_node_allowed(nid, nodes_allowed);
 
-	nid = h->next_nid_to_free;
-	next_nid = next_node_allowed(nid);
-	h->next_nid_to_free = next_nid;
 	return nid;
 }
 
@@ -705,13 +717,14 @@ static int hstate_next_node_to_free(stru
  * balanced over allowed nodes.
  * Called with hugetlb_lock locked.
  */
-static int free_pool_huge_page(struct hstate *h, bool acct_surplus)
+static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
+							 bool acct_surplus)
 {
 	int start_nid;
 	int next_nid;
 	int ret = 0;
 
-	start_nid = hstate_next_node_to_free(h);
+	start_nid = hstate_next_node_to_free(h, nodes_allowed);
 	next_nid = start_nid;
 
 	do {
@@ -735,7 +748,7 @@ static int free_pool_huge_page(struct hs
 			ret = 1;
 			break;
 		}
-		next_nid = hstate_next_node_to_free(h);
+		next_nid = hstate_next_node_to_free(h, nodes_allowed);
 	} while (next_nid != start_nid);
 
 	return ret;
@@ -937,7 +950,7 @@ static void return_unused_surplus_pages(
 	 * on-line nodes for us and will handle the hstate accounting.
 	 */
 	while (nr_pages--) {
-		if (!free_pool_huge_page(h, 1))
+		if (!free_pool_huge_page(h, NULL, 1))
 			break;
 	}
 }
@@ -1047,7 +1060,7 @@ int __weak alloc_bootmem_huge_page(struc
 		void *addr;
 
 		addr = __alloc_bootmem_node_nopanic(
-				NODE_DATA(hstate_next_node_to_alloc(h)),
+				NODE_DATA(hstate_next_node_to_alloc(h, NULL)),
 				huge_page_size(h), huge_page_size(h), 0);
 
 		if (addr) {
@@ -1102,7 +1115,7 @@ static void __init hugetlb_hstate_alloc_
 		if (h->order >= MAX_ORDER) {
 			if (!alloc_bootmem_huge_page(h))
 				break;
-		} else if (!alloc_fresh_huge_page(h))
+		} else if (!alloc_fresh_huge_page(h, NULL))
 			break;
 	}
 	h->max_huge_pages = i;
@@ -1144,16 +1157,22 @@ static void __init report_hugepages(void
 }
 
 #ifdef CONFIG_HIGHMEM
-static void try_to_free_low(struct hstate *h, unsigned long count)
+static void try_to_free_low(struct hstate *h, unsigned long count,
+						nodemask_t *nodes_allowed)
 {
 	int i;
 
 	if (h->order >= MAX_ORDER)
 		return;
 
+	if (!nodes_allowed)
+		nodes_allowed = &node_online_map;
+
 	for (i = 0; i < MAX_NUMNODES; ++i) {
 		struct page *page, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
+		if (!node_isset(i, *nodes_allowed))
+			continue;
 		list_for_each_entry_safe(page, next, freel, lru) {
 			if (count >= h->nr_huge_pages)
 				return;
@@ -1167,7 +1186,8 @@ static void try_to_free_low(struct hstat
 	}
 }
 #else
-static inline void try_to_free_low(struct hstate *h, unsigned long count)
+static inline void try_to_free_low(struct hstate *h, unsigned long count,
+						nodemask_t *nodes_allowed)
 {
 }
 #endif
@@ -1177,7 +1197,8 @@ static inline void try_to_free_low(struc
  * balanced by operating on them in a round-robin fashion.
  * Returns 1 if an adjustment was made.
  */
-static int adjust_pool_surplus(struct hstate *h, int delta)
+static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
+				int delta)
 {
 	int start_nid, next_nid;
 	int ret = 0;
@@ -1185,9 +1206,9 @@ static int adjust_pool_surplus(struct hs
 	VM_BUG_ON(delta != -1 && delta != 1);
 
 	if (delta < 0)
-		start_nid = hstate_next_node_to_alloc(h);
+		start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
 	else
-		start_nid = hstate_next_node_to_free(h);
+		start_nid = hstate_next_node_to_free(h, nodes_allowed);
 	next_nid = start_nid;
 
 	do {
@@ -1197,7 +1218,8 @@ static int adjust_pool_surplus(struct hs
 			 * To shrink on this node, there must be a surplus page
 			 */
 			if (!h->surplus_huge_pages_node[nid]) {
-				next_nid = hstate_next_node_to_alloc(h);
+				next_nid = hstate_next_node_to_alloc(h,
+								nodes_allowed);
 				continue;
 			}
 		}
@@ -1207,7 +1229,8 @@ static int adjust_pool_surplus(struct hs
 			 */
 			if (h->surplus_huge_pages_node[nid] >=
 						h->nr_huge_pages_node[nid]) {
-				next_nid = hstate_next_node_to_free(h);
+				next_nid = hstate_next_node_to_free(h,
+								nodes_allowed);
 				continue;
 			}
 		}
@@ -1242,7 +1265,7 @@ static unsigned long set_max_huge_pages(
 	 */
 	spin_lock(&hugetlb_lock);
 	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, -1))
+		if (!adjust_pool_surplus(h, NULL, -1))
 			break;
 	}
 
@@ -1253,7 +1276,7 @@ static unsigned long set_max_huge_pages(
 		 * and reducing the surplus.
 		 */
 		spin_unlock(&hugetlb_lock);
-		ret = alloc_fresh_huge_page(h);
+		ret = alloc_fresh_huge_page(h, NULL);
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -1277,13 +1300,13 @@ static unsigned long set_max_huge_pages(
 	 */
 	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
 	min_count = max(count, min_count);
-	try_to_free_low(h, min_count);
+	try_to_free_low(h, min_count, NULL);
 	while (min_count < persistent_huge_pages(h)) {
-		if (!free_pool_huge_page(h, 0))
+		if (!free_pool_huge_page(h, NULL, 0))
 			break;
 	}
 	while (count < persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, 1))
+		if (!adjust_pool_surplus(h, NULL, 1))
 			break;
 	}
 out:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 2/6] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns
@ 2009-08-28 16:03   ` Lee Schermerhorn
  0 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-08-28 16:03 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 2/6] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns

Against:  2.6.31-rc7-mmotm-090827-0057

V3:
+ moved this patch to after the "rework" of hstate_next_node_to_...
  functions as this patch is more specific to using task mempolicy
  to control huge page allocation and freeing.

V5:
+ removed now unneeded 'nextnid' from hstate_next_node_to_{alloc|free}
  and updated the stale comments.

In preparation for constraining huge page allocation and freeing by the
controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer
to the allocate, free and surplus adjustment functions.  For now, pass
NULL to indicate default behavior--i.e., use node_online_map.  A
subsqeuent patch will derive a non-default mask from the controlling 
task's numa mempolicy.

Note that this method of updating the global hstate nr_hugepages under
the constraint of a nodemask simplifies keeping the global state 
consistent--especially the number of persistent and surplus pages
relative to reservations and overcommit limits.  There are undoubtedly
other ways to do this, but this works for both interfaces:  mempolicy
and per node attributes.

Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/hugetlb.c |  121 +++++++++++++++++++++++++++++++++++------------------------
 1 file changed, 72 insertions(+), 49 deletions(-)

Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/hugetlb.c	2009-08-28 09:21:21.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c	2009-08-28 09:21:26.000000000 -0400
@@ -622,48 +622,57 @@ static struct page *alloc_fresh_huge_pag
 }
 
 /*
- * common helper function for hstate_next_node_to_{alloc|free}.
- * return next node in node_online_map, wrapping at end.
+ * common helper functions for hstate_next_node_to_{alloc|free}.
+ * We may have allocated or freed a huge page based on a different
+ * nodes_allowed previously, so h->next_node_to_{alloc|free} might
+ * be outside of *nodes_allowed.  Ensure that we use an allowed
+ * node for alloc or free.
  */
-static int next_node_allowed(int nid)
+static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
 {
-	nid = next_node(nid, node_online_map);
+	nid = next_node(nid, *nodes_allowed);
 	if (nid == MAX_NUMNODES)
-		nid = first_node(node_online_map);
+		nid = first_node(*nodes_allowed);
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 
 	return nid;
 }
 
+static int this_node_allowed(int nid, nodemask_t *nodes_allowed)
+{
+	if (!node_isset(nid, *nodes_allowed))
+		nid = next_node_allowed(nid, nodes_allowed);
+	return nid;
+}
+
 /*
- * Use a helper variable to find the next node and then
- * copy it back to next_nid_to_alloc afterwards:
- * otherwise there's a window in which a racer might
- * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
- * But we don't need to use a spin_lock here: it really
- * doesn't matter if occasionally a racer chooses the
- * same nid as we do.  Move nid forward in the mask even
- * if we just successfully allocated a hugepage so that
- * the next caller gets hugepages on the next node.
+ * returns the previously saved node ["this node"] from which to
+ * allocate a persistent huge page for the pool and advance the
+ * next node from which to allocate, handling wrap at end of node
+ * mask.  'nodes_allowed' defaults to node_online_map.
  */
-static int hstate_next_node_to_alloc(struct hstate *h)
+static int hstate_next_node_to_alloc(struct hstate *h,
+					nodemask_t *nodes_allowed)
 {
-	int nid, next_nid;
+	int nid;
+
+	if (!nodes_allowed)
+		nodes_allowed = &node_online_map;
+
+	nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed);
+	h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
 
-	nid = h->next_nid_to_alloc;
-	next_nid = next_node_allowed(nid);
-	h->next_nid_to_alloc = next_nid;
 	return nid;
 }
 
-static int alloc_fresh_huge_page(struct hstate *h)
+static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
 {
 	struct page *page;
 	int start_nid;
 	int next_nid;
 	int ret = 0;
 
-	start_nid = hstate_next_node_to_alloc(h);
+	start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
 	next_nid = start_nid;
 
 	do {
@@ -672,7 +681,7 @@ static int alloc_fresh_huge_page(struct
 			ret = 1;
 			break;
 		}
-		next_nid = hstate_next_node_to_alloc(h);
+		next_nid = hstate_next_node_to_alloc(h, nodes_allowed);
 	} while (next_nid != start_nid);
 
 	if (ret)
@@ -684,18 +693,21 @@ static int alloc_fresh_huge_page(struct
 }
 
 /*
- * helper for free_pool_huge_page() - return the next node
- * from which to free a huge page.  Advance the next node id
- * whether or not we find a free huge page to free so that the
- * next attempt to free addresses the next node.
+ * helper for free_pool_huge_page() - return the previously saved
+ * node ["this node"] from which to free a huge page.  Advance the
+ * next node id whether or not we find a free huge page to free so
+ * that the next attempt to free addresses the next node.
  */
-static int hstate_next_node_to_free(struct hstate *h)
+static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 {
-	int nid, next_nid;
+	int nid;
+
+	if (!nodes_allowed)
+		nodes_allowed = &node_online_map;
+
+	nid = this_node_allowed(h->next_nid_to_free, nodes_allowed);
+	h->next_nid_to_free = next_node_allowed(nid, nodes_allowed);
 
-	nid = h->next_nid_to_free;
-	next_nid = next_node_allowed(nid);
-	h->next_nid_to_free = next_nid;
 	return nid;
 }
 
@@ -705,13 +717,14 @@ static int hstate_next_node_to_free(stru
  * balanced over allowed nodes.
  * Called with hugetlb_lock locked.
  */
-static int free_pool_huge_page(struct hstate *h, bool acct_surplus)
+static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
+							 bool acct_surplus)
 {
 	int start_nid;
 	int next_nid;
 	int ret = 0;
 
-	start_nid = hstate_next_node_to_free(h);
+	start_nid = hstate_next_node_to_free(h, nodes_allowed);
 	next_nid = start_nid;
 
 	do {
@@ -735,7 +748,7 @@ static int free_pool_huge_page(struct hs
 			ret = 1;
 			break;
 		}
-		next_nid = hstate_next_node_to_free(h);
+		next_nid = hstate_next_node_to_free(h, nodes_allowed);
 	} while (next_nid != start_nid);
 
 	return ret;
@@ -937,7 +950,7 @@ static void return_unused_surplus_pages(
 	 * on-line nodes for us and will handle the hstate accounting.
 	 */
 	while (nr_pages--) {
-		if (!free_pool_huge_page(h, 1))
+		if (!free_pool_huge_page(h, NULL, 1))
 			break;
 	}
 }
@@ -1047,7 +1060,7 @@ int __weak alloc_bootmem_huge_page(struc
 		void *addr;
 
 		addr = __alloc_bootmem_node_nopanic(
-				NODE_DATA(hstate_next_node_to_alloc(h)),
+				NODE_DATA(hstate_next_node_to_alloc(h, NULL)),
 				huge_page_size(h), huge_page_size(h), 0);
 
 		if (addr) {
@@ -1102,7 +1115,7 @@ static void __init hugetlb_hstate_alloc_
 		if (h->order >= MAX_ORDER) {
 			if (!alloc_bootmem_huge_page(h))
 				break;
-		} else if (!alloc_fresh_huge_page(h))
+		} else if (!alloc_fresh_huge_page(h, NULL))
 			break;
 	}
 	h->max_huge_pages = i;
@@ -1144,16 +1157,22 @@ static void __init report_hugepages(void
 }
 
 #ifdef CONFIG_HIGHMEM
-static void try_to_free_low(struct hstate *h, unsigned long count)
+static void try_to_free_low(struct hstate *h, unsigned long count,
+						nodemask_t *nodes_allowed)
 {
 	int i;
 
 	if (h->order >= MAX_ORDER)
 		return;
 
+	if (!nodes_allowed)
+		nodes_allowed = &node_online_map;
+
 	for (i = 0; i < MAX_NUMNODES; ++i) {
 		struct page *page, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
+		if (!node_isset(i, *nodes_allowed))
+			continue;
 		list_for_each_entry_safe(page, next, freel, lru) {
 			if (count >= h->nr_huge_pages)
 				return;
@@ -1167,7 +1186,8 @@ static void try_to_free_low(struct hstat
 	}
 }
 #else
-static inline void try_to_free_low(struct hstate *h, unsigned long count)
+static inline void try_to_free_low(struct hstate *h, unsigned long count,
+						nodemask_t *nodes_allowed)
 {
 }
 #endif
@@ -1177,7 +1197,8 @@ static inline void try_to_free_low(struc
  * balanced by operating on them in a round-robin fashion.
  * Returns 1 if an adjustment was made.
  */
-static int adjust_pool_surplus(struct hstate *h, int delta)
+static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
+				int delta)
 {
 	int start_nid, next_nid;
 	int ret = 0;
@@ -1185,9 +1206,9 @@ static int adjust_pool_surplus(struct hs
 	VM_BUG_ON(delta != -1 && delta != 1);
 
 	if (delta < 0)
-		start_nid = hstate_next_node_to_alloc(h);
+		start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
 	else
-		start_nid = hstate_next_node_to_free(h);
+		start_nid = hstate_next_node_to_free(h, nodes_allowed);
 	next_nid = start_nid;
 
 	do {
@@ -1197,7 +1218,8 @@ static int adjust_pool_surplus(struct hs
 			 * To shrink on this node, there must be a surplus page
 			 */
 			if (!h->surplus_huge_pages_node[nid]) {
-				next_nid = hstate_next_node_to_alloc(h);
+				next_nid = hstate_next_node_to_alloc(h,
+								nodes_allowed);
 				continue;
 			}
 		}
@@ -1207,7 +1229,8 @@ static int adjust_pool_surplus(struct hs
 			 */
 			if (h->surplus_huge_pages_node[nid] >=
 						h->nr_huge_pages_node[nid]) {
-				next_nid = hstate_next_node_to_free(h);
+				next_nid = hstate_next_node_to_free(h,
+								nodes_allowed);
 				continue;
 			}
 		}
@@ -1242,7 +1265,7 @@ static unsigned long set_max_huge_pages(
 	 */
 	spin_lock(&hugetlb_lock);
 	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, -1))
+		if (!adjust_pool_surplus(h, NULL, -1))
 			break;
 	}
 
@@ -1253,7 +1276,7 @@ static unsigned long set_max_huge_pages(
 		 * and reducing the surplus.
 		 */
 		spin_unlock(&hugetlb_lock);
-		ret = alloc_fresh_huge_page(h);
+		ret = alloc_fresh_huge_page(h, NULL);
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -1277,13 +1300,13 @@ static unsigned long set_max_huge_pages(
 	 */
 	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
 	min_count = max(count, min_count);
-	try_to_free_low(h, min_count);
+	try_to_free_low(h, min_count, NULL);
 	while (min_count < persistent_huge_pages(h)) {
-		if (!free_pool_huge_page(h, 0))
+		if (!free_pool_huge_page(h, NULL, 0))
 			break;
 	}
 	while (count < persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, 1))
+		if (!adjust_pool_surplus(h, NULL, 1))
 			break;
 	}
 out:

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 3/6] hugetlb:  derive huge pages nodes allowed from task mempolicy
  2009-08-28 16:03 ` Lee Schermerhorn
@ 2009-08-28 16:03   ` Lee Schermerhorn
  -1 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-08-28 16:03 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 3/6] hugetlb:  derive huge pages nodes allowed from task mempolicy

Against: 2.6.31-rc7-mmotm-090827-0057

V2:
+ cleaned up comments, removed some deemed unnecessary,
  add some suggested by review
+ removed check for !current in huge_mpol_nodes_allowed().
+ added 'current->comm' to warning message in huge_mpol_nodes_allowed().
+ added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to
  catch out of range node id.
+ add examples to patch description

V3: Factored this patch from V2 patch 2/3

V4: added back missing "kfree(nodes_allowed)" in set_max_nr_hugepages()

V5: remove internal '\n' from printk in huge_mpol_nodes_allowed()

This patch derives a "nodes_allowed" node mask from the numa
mempolicy of the task modifying the number of persistent huge
pages to control the allocation, freeing and adjusting of surplus
huge pages.  This mask is derived as follows:

* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
  is produced.  This will cause the hugetlb subsystem to use
  node_online_map as the "nodes_allowed".  This preserves the
  behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
  a nodemask with the single preferred node will be produced. 
  "local" policy will NOT track any internode migrations of the
  task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
  will be used.
* Other than to inform the construction of the nodes_allowed node
  mask, the actual mempolicy mode is ignored.  That is, all modes
  behave like interleave over the resulting nodes_allowed mask
  with no "fallback".

Notes:

1) This patch introduces a subtle change in behavior:  huge page
   allocation and freeing will be constrained by any mempolicy
   that the task adjusting the huge page pool inherits from its
   parent.  This policy could come from a distant ancestor.  The
   adminstrator adjusting the huge page pool without explicitly
   specifying a mempolicy via numactl might be surprised by this.
   Additionaly, any mempolicy specified by numactl will be
   constrained by the cpuset in which numactl is invoked.
   Using sysfs per node hugepages attributes to adjust the per
   node persistent huge pages count [subsequent patch] ignores
   mempolicy and cpuset constraints.

2) Hugepages allocated at boot time use the node_online_map.
   An additional patch could implement a temporary boot time
   huge pages nodes_allowed command line parameter.

3) Using mempolicy to control persistent huge page allocation
   and freeing requires no change to hugeadm when invoking
   it via numactl, as shown in the examples below.  However,
   hugeadm could be enhanced to take the allowed nodes as an
   argument and set its task mempolicy itself.  This would allow
   it to detect and warn about any non-default mempolicy that it
   inherited from its parent, thus alleviating the issue described
   in Note 1 above.

See the updated documentation [next patch] for more information
about the implications of this patch.

Examples:

Starting with:

	Node 0 HugePages_Total:     0
	Node 1 HugePages_Total:     0
	Node 2 HugePages_Total:     0
	Node 3 HugePages_Total:     0

Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:

	hugeadm --pool-pages-min=2048Kb:32

yields:

	Node 0 HugePages_Total:     8
	Node 1 HugePages_Total:     8
	Node 2 HugePages_Total:     8
	Node 3 HugePages_Total:     8

Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes.  So, starting from the 
condition above, with 8 huge pages per node:

	numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8

yields:

	Node 0 HugePages_Total:     8
	Node 1 HugePages_Total:     8
	Node 2 HugePages_Total:    16
	Node 3 HugePages_Total:     8

The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.

Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:

	numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8

yields:

	Node 0 HugePages_Total:     4
	Node 1 HugePages_Total:     4
	Node 2 HugePages_Total:    16
	Node 3 HugePages_Total:     8

The 8 huge pages freed were balanced over nodes 0 and 1.

Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/mempolicy.h |    3 ++
 mm/hugetlb.c              |   14 ++++++----
 mm/mempolicy.c            |   61 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 73 insertions(+), 5 deletions(-)

Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/mempolicy.c
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/mempolicy.c	2009-08-28 09:21:20.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/mm/mempolicy.c	2009-08-28 09:21:28.000000000 -0400
@@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm
 	}
 	return zl;
 }
+
+/*
+ * huge_mpol_nodes_allowed -- mempolicy extension for huge pages.
+ *
+ * Returns a [pointer to a] nodelist based on the current task's mempolicy
+ * to constraing the allocation and freeing of persistent huge pages
+ * 'Preferred', 'local' and 'interleave' mempolicy will behave more like
+ * 'bind' policy in this context.  An attempt to allocate a persistent huge
+ * page will never "fallback" to another node inside the buddy system
+ * allocator.
+ *
+ * If the task's mempolicy is "default" [NULL], just return NULL for
+ * default behavior.  Otherwise, extract the policy nodemask for 'bind'
+ * or 'interleave' policy or construct a nodemask for 'preferred' or
+ * 'local' policy and return a pointer to a kmalloc()ed nodemask_t.
+ *
+ * N.B., it is the caller's responsibility to free a returned nodemask.
+ */
+nodemask_t *huge_mpol_nodes_allowed(void)
+{
+	nodemask_t *nodes_allowed = NULL;
+	struct mempolicy *mempolicy;
+	int nid;
+
+	if (!current->mempolicy)
+		return NULL;
+
+	mpol_get(current->mempolicy);
+	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
+	if (!nodes_allowed) {
+		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
+			"for huge page allocation.  Falling back to default.\n",
+			current->comm);
+		goto out;
+	}
+	nodes_clear(*nodes_allowed);
+
+	mempolicy = current->mempolicy;
+	switch (mempolicy->mode) {
+	case MPOL_PREFERRED:
+		if (mempolicy->flags & MPOL_F_LOCAL)
+			nid = numa_node_id();
+		else
+			nid = mempolicy->v.preferred_node;
+		node_set(nid, *nodes_allowed);
+		break;
+
+	case MPOL_BIND:
+		/* Fall through */
+	case MPOL_INTERLEAVE:
+		*nodes_allowed =  mempolicy->v.nodes;
+		break;
+
+	default:
+		BUG();
+	}
+
+out:
+	mpol_put(current->mempolicy);
+	return nodes_allowed;
+}
 #endif
 
 /* Allocate a page in interleaved policy.
Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/mempolicy.h
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/mempolicy.h	2009-08-28 09:21:20.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/mempolicy.h	2009-08-28 09:21:28.000000000 -0400
@@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
+extern nodemask_t *huge_mpol_nodes_allowed(void);
 extern unsigned slab_node(struct mempolicy *policy);
 
 extern enum zone_type policy_zone;
@@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone
 	return node_zonelist(0, gfp_flags);
 }
 
+static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; }
+
 static inline int do_migrate_pages(struct mm_struct *mm,
 			const nodemask_t *from_nodes,
 			const nodemask_t *to_nodes, int flags)
Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/hugetlb.c	2009-08-28 09:21:26.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c	2009-08-28 09:21:28.000000000 -0400
@@ -1248,10 +1248,13 @@ static int adjust_pool_surplus(struct hs
 static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
 {
 	unsigned long min_count, ret;
+	nodemask_t *nodes_allowed;
 
 	if (h->order >= MAX_ORDER)
 		return h->max_huge_pages;
 
+	nodes_allowed = huge_mpol_nodes_allowed();
+
 	/*
 	 * Increase the pool size
 	 * First take pages out of surplus state.  Then make up the
@@ -1265,7 +1268,7 @@ static unsigned long set_max_huge_pages(
 	 */
 	spin_lock(&hugetlb_lock);
 	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, NULL, -1))
+		if (!adjust_pool_surplus(h, nodes_allowed, -1))
 			break;
 	}
 
@@ -1276,7 +1279,7 @@ static unsigned long set_max_huge_pages(
 		 * and reducing the surplus.
 		 */
 		spin_unlock(&hugetlb_lock);
-		ret = alloc_fresh_huge_page(h, NULL);
+		ret = alloc_fresh_huge_page(h, nodes_allowed);
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -1300,18 +1303,19 @@ static unsigned long set_max_huge_pages(
 	 */
 	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
 	min_count = max(count, min_count);
-	try_to_free_low(h, min_count, NULL);
+	try_to_free_low(h, min_count, nodes_allowed);
 	while (min_count < persistent_huge_pages(h)) {
-		if (!free_pool_huge_page(h, NULL, 0))
+		if (!free_pool_huge_page(h, nodes_allowed, 0))
 			break;
 	}
 	while (count < persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, NULL, 1))
+		if (!adjust_pool_surplus(h, nodes_allowed, 1))
 			break;
 	}
 out:
 	ret = persistent_huge_pages(h);
 	spin_unlock(&hugetlb_lock);
+	kfree(nodes_allowed);
 	return ret;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 3/6] hugetlb:  derive huge pages nodes allowed from task mempolicy
@ 2009-08-28 16:03   ` Lee Schermerhorn
  0 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-08-28 16:03 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 3/6] hugetlb:  derive huge pages nodes allowed from task mempolicy

Against: 2.6.31-rc7-mmotm-090827-0057

V2:
+ cleaned up comments, removed some deemed unnecessary,
  add some suggested by review
+ removed check for !current in huge_mpol_nodes_allowed().
+ added 'current->comm' to warning message in huge_mpol_nodes_allowed().
+ added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to
  catch out of range node id.
+ add examples to patch description

V3: Factored this patch from V2 patch 2/3

V4: added back missing "kfree(nodes_allowed)" in set_max_nr_hugepages()

V5: remove internal '\n' from printk in huge_mpol_nodes_allowed()

This patch derives a "nodes_allowed" node mask from the numa
mempolicy of the task modifying the number of persistent huge
pages to control the allocation, freeing and adjusting of surplus
huge pages.  This mask is derived as follows:

* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
  is produced.  This will cause the hugetlb subsystem to use
  node_online_map as the "nodes_allowed".  This preserves the
  behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
  a nodemask with the single preferred node will be produced. 
  "local" policy will NOT track any internode migrations of the
  task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
  will be used.
* Other than to inform the construction of the nodes_allowed node
  mask, the actual mempolicy mode is ignored.  That is, all modes
  behave like interleave over the resulting nodes_allowed mask
  with no "fallback".

Notes:

1) This patch introduces a subtle change in behavior:  huge page
   allocation and freeing will be constrained by any mempolicy
   that the task adjusting the huge page pool inherits from its
   parent.  This policy could come from a distant ancestor.  The
   adminstrator adjusting the huge page pool without explicitly
   specifying a mempolicy via numactl might be surprised by this.
   Additionaly, any mempolicy specified by numactl will be
   constrained by the cpuset in which numactl is invoked.
   Using sysfs per node hugepages attributes to adjust the per
   node persistent huge pages count [subsequent patch] ignores
   mempolicy and cpuset constraints.

2) Hugepages allocated at boot time use the node_online_map.
   An additional patch could implement a temporary boot time
   huge pages nodes_allowed command line parameter.

3) Using mempolicy to control persistent huge page allocation
   and freeing requires no change to hugeadm when invoking
   it via numactl, as shown in the examples below.  However,
   hugeadm could be enhanced to take the allowed nodes as an
   argument and set its task mempolicy itself.  This would allow
   it to detect and warn about any non-default mempolicy that it
   inherited from its parent, thus alleviating the issue described
   in Note 1 above.

See the updated documentation [next patch] for more information
about the implications of this patch.

Examples:

Starting with:

	Node 0 HugePages_Total:     0
	Node 1 HugePages_Total:     0
	Node 2 HugePages_Total:     0
	Node 3 HugePages_Total:     0

Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:

	hugeadm --pool-pages-min=2048Kb:32

yields:

	Node 0 HugePages_Total:     8
	Node 1 HugePages_Total:     8
	Node 2 HugePages_Total:     8
	Node 3 HugePages_Total:     8

Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes.  So, starting from the 
condition above, with 8 huge pages per node:

	numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8

yields:

	Node 0 HugePages_Total:     8
	Node 1 HugePages_Total:     8
	Node 2 HugePages_Total:    16
	Node 3 HugePages_Total:     8

The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.

Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:

	numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8

yields:

	Node 0 HugePages_Total:     4
	Node 1 HugePages_Total:     4
	Node 2 HugePages_Total:    16
	Node 3 HugePages_Total:     8

The 8 huge pages freed were balanced over nodes 0 and 1.

Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/mempolicy.h |    3 ++
 mm/hugetlb.c              |   14 ++++++----
 mm/mempolicy.c            |   61 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 73 insertions(+), 5 deletions(-)

Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/mempolicy.c
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/mempolicy.c	2009-08-28 09:21:20.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/mm/mempolicy.c	2009-08-28 09:21:28.000000000 -0400
@@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm
 	}
 	return zl;
 }
+
+/*
+ * huge_mpol_nodes_allowed -- mempolicy extension for huge pages.
+ *
+ * Returns a [pointer to a] nodelist based on the current task's mempolicy
+ * to constraing the allocation and freeing of persistent huge pages
+ * 'Preferred', 'local' and 'interleave' mempolicy will behave more like
+ * 'bind' policy in this context.  An attempt to allocate a persistent huge
+ * page will never "fallback" to another node inside the buddy system
+ * allocator.
+ *
+ * If the task's mempolicy is "default" [NULL], just return NULL for
+ * default behavior.  Otherwise, extract the policy nodemask for 'bind'
+ * or 'interleave' policy or construct a nodemask for 'preferred' or
+ * 'local' policy and return a pointer to a kmalloc()ed nodemask_t.
+ *
+ * N.B., it is the caller's responsibility to free a returned nodemask.
+ */
+nodemask_t *huge_mpol_nodes_allowed(void)
+{
+	nodemask_t *nodes_allowed = NULL;
+	struct mempolicy *mempolicy;
+	int nid;
+
+	if (!current->mempolicy)
+		return NULL;
+
+	mpol_get(current->mempolicy);
+	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
+	if (!nodes_allowed) {
+		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
+			"for huge page allocation.  Falling back to default.\n",
+			current->comm);
+		goto out;
+	}
+	nodes_clear(*nodes_allowed);
+
+	mempolicy = current->mempolicy;
+	switch (mempolicy->mode) {
+	case MPOL_PREFERRED:
+		if (mempolicy->flags & MPOL_F_LOCAL)
+			nid = numa_node_id();
+		else
+			nid = mempolicy->v.preferred_node;
+		node_set(nid, *nodes_allowed);
+		break;
+
+	case MPOL_BIND:
+		/* Fall through */
+	case MPOL_INTERLEAVE:
+		*nodes_allowed =  mempolicy->v.nodes;
+		break;
+
+	default:
+		BUG();
+	}
+
+out:
+	mpol_put(current->mempolicy);
+	return nodes_allowed;
+}
 #endif
 
 /* Allocate a page in interleaved policy.
Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/mempolicy.h
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/mempolicy.h	2009-08-28 09:21:20.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/mempolicy.h	2009-08-28 09:21:28.000000000 -0400
@@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
+extern nodemask_t *huge_mpol_nodes_allowed(void);
 extern unsigned slab_node(struct mempolicy *policy);
 
 extern enum zone_type policy_zone;
@@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone
 	return node_zonelist(0, gfp_flags);
 }
 
+static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; }
+
 static inline int do_migrate_pages(struct mm_struct *mm,
 			const nodemask_t *from_nodes,
 			const nodemask_t *to_nodes, int flags)
Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/hugetlb.c	2009-08-28 09:21:26.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c	2009-08-28 09:21:28.000000000 -0400
@@ -1248,10 +1248,13 @@ static int adjust_pool_surplus(struct hs
 static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
 {
 	unsigned long min_count, ret;
+	nodemask_t *nodes_allowed;
 
 	if (h->order >= MAX_ORDER)
 		return h->max_huge_pages;
 
+	nodes_allowed = huge_mpol_nodes_allowed();
+
 	/*
 	 * Increase the pool size
 	 * First take pages out of surplus state.  Then make up the
@@ -1265,7 +1268,7 @@ static unsigned long set_max_huge_pages(
 	 */
 	spin_lock(&hugetlb_lock);
 	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, NULL, -1))
+		if (!adjust_pool_surplus(h, nodes_allowed, -1))
 			break;
 	}
 
@@ -1276,7 +1279,7 @@ static unsigned long set_max_huge_pages(
 		 * and reducing the surplus.
 		 */
 		spin_unlock(&hugetlb_lock);
-		ret = alloc_fresh_huge_page(h, NULL);
+		ret = alloc_fresh_huge_page(h, nodes_allowed);
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -1300,18 +1303,19 @@ static unsigned long set_max_huge_pages(
 	 */
 	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
 	min_count = max(count, min_count);
-	try_to_free_low(h, min_count, NULL);
+	try_to_free_low(h, min_count, nodes_allowed);
 	while (min_count < persistent_huge_pages(h)) {
-		if (!free_pool_huge_page(h, NULL, 0))
+		if (!free_pool_huge_page(h, nodes_allowed, 0))
 			break;
 	}
 	while (count < persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, NULL, 1))
+		if (!adjust_pool_surplus(h, nodes_allowed, 1))
 			break;
 	}
 out:
 	ret = persistent_huge_pages(h);
 	spin_unlock(&hugetlb_lock);
+	kfree(nodes_allowed);
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 4/6] hugetlb:  introduce alloc_nodemask_of_node
  2009-08-28 16:03 ` Lee Schermerhorn
@ 2009-08-28 16:03   ` Lee Schermerhorn
  -1 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-08-28 16:03 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 4/6] - hugetlb:  introduce alloc_nodemask_of_node()

Against:  2.6.31-rc7-mmotm-090827-0057

New in V5 of series

Introduce nodemask macro to allocate a nodemask and 
initialize it to contain a single node, using the macro
init_nodemask_of_node() factored out of the nodemask_of_node()
macro.

alloc_nodemask_of_node() coded as a macro to avoid header
dependency hell.

This will be used to construct the huge pages "nodes_allowed"
nodemask for a single node when a persistent huge page
pool page count is modified via a per node sysfs attribute.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/nodemask.h |   20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/nodemask.h
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/nodemask.h	2009-08-28 09:21:19.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/nodemask.h	2009-08-28 09:21:29.000000000 -0400
@@ -245,18 +245,34 @@ static inline int __next_node(int n, con
 	return min_t(int,MAX_NUMNODES,find_next_bit(srcp->bits, MAX_NUMNODES, n+1));
 }
 
+#define init_nodemask_of_nodes(mask, node)				\
+	nodes_clear(*(mask));						\
+	node_set((node), *(mask));
+
 #define nodemask_of_node(node)						\
 ({									\
 	typeof(_unused_nodemask_arg_) m;				\
 	if (sizeof(m) == sizeof(unsigned long)) {			\
 		m.bits[0] = 1UL<<(node);				\
 	} else {							\
-		nodes_clear(m);						\
-		node_set((node), m);					\
+		init_nodemask_of_nodes(&m, (node));			\
 	}								\
 	m;								\
 })
 
+/*
+ * returns pointer to kmalloc()'d nodemask initialized to contain the
+ * specified node.  Caller must free with kfree().
+ */
+#define alloc_nodemask_of_node(node)					\
+({									\
+	typeof(_unused_nodemask_arg_) *nmp;				\
+	nmp = kmalloc(sizeof(*nmp), GFP_KERNEL);			\
+	if (nmp)							\
+		init_nodemask_of_nodes(nmp, (node));			\
+	nmp;								\
+})
+
 #define first_unset_node(mask) __first_unset_node(&(mask))
 static inline int __first_unset_node(const nodemask_t *maskp)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 4/6] hugetlb:  introduce alloc_nodemask_of_node
@ 2009-08-28 16:03   ` Lee Schermerhorn
  0 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-08-28 16:03 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 4/6] - hugetlb:  introduce alloc_nodemask_of_node()

Against:  2.6.31-rc7-mmotm-090827-0057

New in V5 of series

Introduce nodemask macro to allocate a nodemask and 
initialize it to contain a single node, using the macro
init_nodemask_of_node() factored out of the nodemask_of_node()
macro.

alloc_nodemask_of_node() coded as a macro to avoid header
dependency hell.

This will be used to construct the huge pages "nodes_allowed"
nodemask for a single node when a persistent huge page
pool page count is modified via a per node sysfs attribute.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/nodemask.h |   20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/nodemask.h
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/nodemask.h	2009-08-28 09:21:19.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/nodemask.h	2009-08-28 09:21:29.000000000 -0400
@@ -245,18 +245,34 @@ static inline int __next_node(int n, con
 	return min_t(int,MAX_NUMNODES,find_next_bit(srcp->bits, MAX_NUMNODES, n+1));
 }
 
+#define init_nodemask_of_nodes(mask, node)				\
+	nodes_clear(*(mask));						\
+	node_set((node), *(mask));
+
 #define nodemask_of_node(node)						\
 ({									\
 	typeof(_unused_nodemask_arg_) m;				\
 	if (sizeof(m) == sizeof(unsigned long)) {			\
 		m.bits[0] = 1UL<<(node);				\
 	} else {							\
-		nodes_clear(m);						\
-		node_set((node), m);					\
+		init_nodemask_of_nodes(&m, (node));			\
 	}								\
 	m;								\
 })
 
+/*
+ * returns pointer to kmalloc()'d nodemask initialized to contain the
+ * specified node.  Caller must free with kfree().
+ */
+#define alloc_nodemask_of_node(node)					\
+({									\
+	typeof(_unused_nodemask_arg_) *nmp;				\
+	nmp = kmalloc(sizeof(*nmp), GFP_KERNEL);			\
+	if (nmp)							\
+		init_nodemask_of_nodes(nmp, (node));			\
+	nmp;								\
+})
+
 #define first_unset_node(mask) __first_unset_node(&(mask))
 static inline int __first_unset_node(const nodemask_t *maskp)
 {

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 5/6] hugetlb:  add per node hstate attributes
  2009-08-28 16:03 ` Lee Schermerhorn
@ 2009-08-28 16:03   ` Lee Schermerhorn
  -1 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-08-28 16:03 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 5/6] hugetlb:  register per node hugepages attributes

Against: 2.6.31-rc7-mmotm-090820-1918

V2:  remove dependency on kobject private bitfield.  Search
     global hstates then all per node hstates for kobject
     match in attribute show/store functions.

V3:  rebase atop the mempolicy-based hugepage alloc/free;
     use custom "nodes_allowed" to restrict alloc/free to
     a specific node via per node attributes.  Per node
     attribute overrides mempolicy.  I.e., mempolicy only
     applies to global attributes.

V5:  Fix issues raised by Mel Gorman:
     + add !NUMA versions of hugetlb_[un]register_node()
     + rename 'hi' to 'i' in kobj_to_node_hstate()
     + rename (count, input) to (len, count) in nr_hugepages_store()
     + moved per node hugepages_kobj and hstate_kobjs[] from the
       struct node [sysdev] to hugetlb.c private arrays.
     + changed registration mechanism so that hugetlbfs [a module]
       register its attributes registration callbacks with the node
       driver, eliminating the dependency between the node driver
       and hugetlbfs.  From it's init func, hugetlbfs will register
       all on-line nodes' hugepage sysfs attributes along with
       hugetlbfs' attributes register/unregister functions.  The
       node driver will use these functions to [un]register nodes
       with hugetlbfs on node hot-plug.
     + replaced hugetlb.c private "nodes_allowed_from_node()" with
       [new] generic "alloc_nodemask_of_node()".

V5a: + fix !NUMA register_hugetlbfs_with_node():  don't use
       keyword 'do' as parameter name!

This patch adds the per huge page size control/query attributes
to the per node sysdevs:

/sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
	nr_hugepages       - r/w
	free_huge_pages    - r/o
	surplus_huge_pages - r/o

The patch attempts to re-use/share as much of the existing
global hstate attribute initialization and handling, and the
"nodes_allowed" constraint processing as possible.
Calling set_max_huge_pages() with no node indicates a change to
global hstate parameters.  In this case, any non-default task
mempolicy will be used to generate the nodes_allowed mask.  A
valid node id indicates an update to that node's hstate 
parameters, and the count argument specifies the target count
for the specified node.  From this info, we compute the target
global count for the hstate and construct a nodes_allowed node
mask contain only the specified node.

Setting the node specific nr_hugepages via the per node attribute
effectively ignores any task mempolicy or cpuset constraints.

With this patch:

(me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
./  ../  free_hugepages  nr_hugepages  surplus_hugepages

Starting from:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     0
Node 2 HugePages_Free:      0
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 0

Allocate 16 persistent huge pages on node 2:
(me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

[Note that this is equivalent to:
	numactl -m 2 hugeadmin --pool-pages-min 2M:+16
]

Yields:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:    16
Node 2 HugePages_Free:     16
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 16

Global controls work as expected--reduce pool to 8 persistent huge pages:
(me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     8
Node 2 HugePages_Free:      8
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 drivers/base/node.c  |   33 ++++++
 include/linux/node.h |    8 +
 include/linux/numa.h |    2 
 mm/hugetlb.c         |  245 ++++++++++++++++++++++++++++++++++++++++++++-------
 4 files changed, 258 insertions(+), 30 deletions(-)

Index: linux-2.6.31-rc7-mmotm-090827-0057/drivers/base/node.c
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/drivers/base/node.c	2009-08-28 09:21:17.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/drivers/base/node.c	2009-08-28 09:21:31.000000000 -0400
@@ -177,6 +177,37 @@ static ssize_t node_read_distance(struct
 }
 static SYSDEV_ATTR(distance, S_IRUGO, node_read_distance, NULL);
 
+/*
+ * hugetlbfs per node attributes registration interface:
+ * When/if hugetlb[fs] subsystem initializes [sometime after this module],
+ * it will register it's per node attributes for all nodes on-line at that
+ * point.  It will also call register_hugetlbfs_with_node(), below, to
+ * register it's attribute registration functions with this node driver.
+ * Once these hooks have been initialized, the node driver will call into
+ * the hugetlb module to [un]register attributes for hot-plugged nodes.
+ */
+NODE_REGISTRATION_FUNC __hugetlb_register_node;
+NODE_REGISTRATION_FUNC __hugetlb_unregister_node;
+
+static inline void hugetlb_register_node(struct node *node)
+{
+	if (__hugetlb_register_node)
+		__hugetlb_register_node(node);
+}
+
+static inline void hugetlb_unregister_node(struct node *node)
+{
+	if (__hugetlb_unregister_node)
+		__hugetlb_unregister_node(node);
+}
+
+void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC doregister,
+                                  NODE_REGISTRATION_FUNC unregister)
+{
+	__hugetlb_register_node   = doregister;
+	__hugetlb_unregister_node = unregister;
+}
+
 
 /*
  * register_node - Setup a sysfs device for a node.
@@ -200,6 +231,7 @@ int register_node(struct node *node, int
 		sysdev_create_file(&node->sysdev, &attr_distance);
 
 		scan_unevictable_register_node(node);
+		hugetlb_register_node(node);
 	}
 	return error;
 }
@@ -220,6 +252,7 @@ void unregister_node(struct node *node)
 	sysdev_remove_file(&node->sysdev, &attr_distance);
 
 	scan_unevictable_unregister_node(node);
+	hugetlb_unregister_node(node);
 
 	sysdev_unregister(&node->sysdev);
 }
Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/hugetlb.c	2009-08-28 09:21:28.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c	2009-08-28 09:21:31.000000000 -0400
@@ -24,6 +24,7 @@
 #include <asm/io.h>
 
 #include <linux/hugetlb.h>
+#include <linux/node.h>
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
@@ -1245,7 +1246,8 @@ static int adjust_pool_surplus(struct hs
 }
 
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
-static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
+static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
+								int nid)
 {
 	unsigned long min_count, ret;
 	nodemask_t *nodes_allowed;
@@ -1253,7 +1255,21 @@ static unsigned long set_max_huge_pages(
 	if (h->order >= MAX_ORDER)
 		return h->max_huge_pages;
 
-	nodes_allowed = huge_mpol_nodes_allowed();
+	if (nid == NO_NODEID_SPECIFIED)
+		nodes_allowed = huge_mpol_nodes_allowed();
+	else {
+		/*
+		 * incoming 'count' is for node 'nid' only, so
+		 * adjust count to global, but restrict alloc/free
+		 * to the specified node.
+		 */
+		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
+		nodes_allowed = alloc_nodemask_of_node(nid);
+		if (!nodes_allowed)
+			printk(KERN_WARNING "%s unable to allocate allowed "
+			       "nodes mask for huge page allocation/free.  "
+			       "Falling back to default.\n", current->comm);
+	}
 
 	/*
 	 * Increase the pool size
@@ -1329,51 +1345,71 @@ out:
 static struct kobject *hugepages_kobj;
 static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
 
-static struct hstate *kobj_to_hstate(struct kobject *kobj)
+static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp);
+
+static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
 {
 	int i;
+
 	for (i = 0; i < HUGE_MAX_HSTATE; i++)
-		if (hstate_kobjs[i] == kobj)
+		if (hstate_kobjs[i] == kobj) {
+			if (nidp)
+				*nidp = NO_NODEID_SPECIFIED;
 			return &hstates[i];
-	BUG();
-	return NULL;
+		}
+
+	return kobj_to_node_hstate(kobj, nidp);
 }
 
 static ssize_t nr_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->nr_huge_pages);
+	struct hstate *h;
+	unsigned long nr_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid == NO_NODEID_SPECIFIED)
+		nr_huge_pages = h->nr_huge_pages;
+	else
+		nr_huge_pages = h->nr_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", nr_huge_pages);
 }
+
 static ssize_t nr_hugepages_store(struct kobject *kobj,
-		struct kobj_attribute *attr, const char *buf, size_t count)
+		struct kobj_attribute *attr, const char *buf, size_t len)
 {
+	unsigned long count;
+	struct hstate *h;
+	int nid;
 	int err;
-	unsigned long input;
-	struct hstate *h = kobj_to_hstate(kobj);
 
-	err = strict_strtoul(buf, 10, &input);
+	err = strict_strtoul(buf, 10, &count);
 	if (err)
 		return 0;
 
-	h->max_huge_pages = set_max_huge_pages(h, input);
+	h = kobj_to_hstate(kobj, &nid);
+	h->max_huge_pages = set_max_huge_pages(h, count, nid);
 
-	return count;
+	return len;
 }
 HSTATE_ATTR(nr_hugepages);
 
 static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
+
 	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
 }
+
 static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
 		struct kobj_attribute *attr, const char *buf, size_t count)
 {
 	int err;
 	unsigned long input;
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
 
 	err = strict_strtoul(buf, 10, &input);
 	if (err)
@@ -1390,15 +1426,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
 static ssize_t free_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->free_huge_pages);
+	struct hstate *h;
+	unsigned long free_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid == NO_NODEID_SPECIFIED)
+		free_huge_pages = h->free_huge_pages;
+	else
+		free_huge_pages = h->free_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", free_huge_pages);
 }
 HSTATE_ATTR_RO(free_hugepages);
 
 static ssize_t resv_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
 	return sprintf(buf, "%lu\n", h->resv_huge_pages);
 }
 HSTATE_ATTR_RO(resv_hugepages);
@@ -1406,8 +1451,17 @@ HSTATE_ATTR_RO(resv_hugepages);
 static ssize_t surplus_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
+	struct hstate *h;
+	unsigned long surplus_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid == NO_NODEID_SPECIFIED)
+		surplus_huge_pages = h->surplus_huge_pages;
+	else
+		surplus_huge_pages = h->surplus_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", surplus_huge_pages);
 }
 HSTATE_ATTR_RO(surplus_hugepages);
 
@@ -1424,19 +1478,21 @@ static struct attribute_group hstate_att
 	.attrs = hstate_attrs,
 };
 
-static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
+static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
+				struct kobject *parent,
+				struct kobject **hstate_kobjs,
+				struct attribute_group *hstate_attr_group)
 {
 	int retval;
+	int hi = h - hstates;
 
-	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
-							hugepages_kobj);
-	if (!hstate_kobjs[h - hstates])
+	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
+	if (!hstate_kobjs[hi])
 		return -ENOMEM;
 
-	retval = sysfs_create_group(hstate_kobjs[h - hstates],
-							&hstate_attr_group);
+	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
 	if (retval)
-		kobject_put(hstate_kobjs[h - hstates]);
+		kobject_put(hstate_kobjs[hi]);
 
 	return retval;
 }
@@ -1451,17 +1507,143 @@ static void __init hugetlb_sysfs_init(vo
 		return;
 
 	for_each_hstate(h) {
-		err = hugetlb_sysfs_add_hstate(h);
+		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
+					 hstate_kobjs, &hstate_attr_group);
 		if (err)
 			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
 								h->name);
 	}
 }
 
+#ifdef CONFIG_NUMA
+
+struct node_hstate {
+	struct kobject		*hugepages_kobj;
+	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
+};
+struct node_hstate node_hstates[MAX_NUMNODES];
+
+static struct attribute *per_node_hstate_attrs[] = {
+	&nr_hugepages_attr.attr,
+	&free_hugepages_attr.attr,
+	&surplus_hugepages_attr.attr,
+	NULL,
+};
+
+static struct attribute_group per_node_hstate_attr_group = {
+	.attrs = per_node_hstate_attrs,
+};
+
+static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		struct node_hstate *nhs = &node_hstates[nid];
+		int i;
+		for (i = 0; i < HUGE_MAX_HSTATE; i++)
+			if (nhs->hstate_kobjs[i] == kobj) {
+				if (nidp)
+					*nidp = nid;
+				return &hstates[i];
+			}
+	}
+
+	BUG();
+	return NULL;
+}
+
+void hugetlb_unregister_node(struct node *node)
+{
+	struct hstate *h;
+	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
+
+	if (!nhs->hugepages_kobj)
+		return;
+
+	for_each_hstate(h)
+		if (nhs->hstate_kobjs[h - hstates]) {
+			kobject_put(nhs->hstate_kobjs[h - hstates]);
+			nhs->hstate_kobjs[h - hstates] = NULL;
+		}
+
+	kobject_put(nhs->hugepages_kobj);
+	nhs->hugepages_kobj = NULL;
+}
+
+static void hugetlb_unregister_all_nodes(void)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++)
+		hugetlb_unregister_node(&node_devices[nid]);
+
+	register_hugetlbfs_with_node(NULL, NULL);
+}
+
+void hugetlb_register_node(struct node *node)
+{
+	struct hstate *h;
+	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
+	int err;
+
+	if (nhs->hugepages_kobj)
+		return;		/* already allocated */
+
+	nhs->hugepages_kobj = kobject_create_and_add("hugepages",
+							&node->sysdev.kobj);
+	if (!nhs->hugepages_kobj)
+		return;
+
+	for_each_hstate(h) {
+		err = hugetlb_sysfs_add_hstate(h, nhs->hugepages_kobj,
+						nhs->hstate_kobjs,
+						&per_node_hstate_attr_group);
+		if (err) {
+			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
+					" for node %d\n",
+						h->name, node->sysdev.id);
+			hugetlb_unregister_node(node);
+			break;
+		}
+	}
+}
+
+static void hugetlb_register_all_nodes(void)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		struct node *node = &node_devices[nid];
+		if (node->sysdev.id == nid)
+			hugetlb_register_node(node);
+	}
+
+	register_hugetlbfs_with_node(hugetlb_register_node,
+                                     hugetlb_unregister_node);
+}
+#else	/* !CONFIG_NUMA */
+
+static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
+{
+	BUG();
+	if (nidp)
+		*nidp = -1;
+	return NULL;
+}
+
+static void hugetlb_unregister_all_nodes(void) { }
+
+static void hugetlb_register_all_nodes(void) { }
+
+#endif
+
 static void __exit hugetlb_exit(void)
 {
 	struct hstate *h;
 
+	hugetlb_unregister_all_nodes();
+
 	for_each_hstate(h) {
 		kobject_put(hstate_kobjs[h - hstates]);
 	}
@@ -1496,6 +1678,8 @@ static int __init hugetlb_init(void)
 
 	hugetlb_sysfs_init();
 
+	hugetlb_register_all_nodes();
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -1598,7 +1782,8 @@ int hugetlb_sysctl_handler(struct ctl_ta
 	proc_doulongvec_minmax(table, write, buffer, length, ppos);
 
 	if (write)
-		h->max_huge_pages = set_max_huge_pages(h, tmp);
+		h->max_huge_pages = set_max_huge_pages(h, tmp,
+		                                       NO_NODEID_SPECIFIED);
 
 	return 0;
 }
Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/numa.h	2009-08-28 09:21:17.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h	2009-08-28 09:21:31.000000000 -0400
@@ -10,4 +10,6 @@
 
 #define MAX_NUMNODES    (1 << NODES_SHIFT)
 
+#define NO_NODEID_SPECIFIED	(-1)
+
 #endif /* _LINUX_NUMA_H */
Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/node.h
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/node.h	2009-08-28 09:21:17.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/node.h	2009-08-28 09:21:31.000000000 -0400
@@ -28,6 +28,7 @@ struct node {
 
 struct memory_block;
 extern struct node node_devices[];
+typedef  void (*NODE_REGISTRATION_FUNC)(struct node *);
 
 extern int register_node(struct node *, int, struct node *);
 extern void unregister_node(struct node *node);
@@ -39,6 +40,8 @@ extern int unregister_cpu_under_node(uns
 extern int register_mem_sect_under_node(struct memory_block *mem_blk,
 						int nid);
 extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk);
+extern void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC doregister,
+                                         NODE_REGISTRATION_FUNC unregister);
 #else
 static inline int register_one_node(int nid)
 {
@@ -65,6 +68,11 @@ static inline int unregister_mem_sect_un
 {
 	return 0;
 }
+
+static inline void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC reg,
+                                                NODE_REGISTRATION_FUNC unreg)
+{
+}
 #endif
 
 #define to_node(sys_device) container_of(sys_device, struct node, sysdev)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 5/6] hugetlb:  add per node hstate attributes
@ 2009-08-28 16:03   ` Lee Schermerhorn
  0 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-08-28 16:03 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 5/6] hugetlb:  register per node hugepages attributes

Against: 2.6.31-rc7-mmotm-090820-1918

V2:  remove dependency on kobject private bitfield.  Search
     global hstates then all per node hstates for kobject
     match in attribute show/store functions.

V3:  rebase atop the mempolicy-based hugepage alloc/free;
     use custom "nodes_allowed" to restrict alloc/free to
     a specific node via per node attributes.  Per node
     attribute overrides mempolicy.  I.e., mempolicy only
     applies to global attributes.

V5:  Fix issues raised by Mel Gorman:
     + add !NUMA versions of hugetlb_[un]register_node()
     + rename 'hi' to 'i' in kobj_to_node_hstate()
     + rename (count, input) to (len, count) in nr_hugepages_store()
     + moved per node hugepages_kobj and hstate_kobjs[] from the
       struct node [sysdev] to hugetlb.c private arrays.
     + changed registration mechanism so that hugetlbfs [a module]
       register its attributes registration callbacks with the node
       driver, eliminating the dependency between the node driver
       and hugetlbfs.  From it's init func, hugetlbfs will register
       all on-line nodes' hugepage sysfs attributes along with
       hugetlbfs' attributes register/unregister functions.  The
       node driver will use these functions to [un]register nodes
       with hugetlbfs on node hot-plug.
     + replaced hugetlb.c private "nodes_allowed_from_node()" with
       [new] generic "alloc_nodemask_of_node()".

V5a: + fix !NUMA register_hugetlbfs_with_node():  don't use
       keyword 'do' as parameter name!

This patch adds the per huge page size control/query attributes
to the per node sysdevs:

/sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
	nr_hugepages       - r/w
	free_huge_pages    - r/o
	surplus_huge_pages - r/o

The patch attempts to re-use/share as much of the existing
global hstate attribute initialization and handling, and the
"nodes_allowed" constraint processing as possible.
Calling set_max_huge_pages() with no node indicates a change to
global hstate parameters.  In this case, any non-default task
mempolicy will be used to generate the nodes_allowed mask.  A
valid node id indicates an update to that node's hstate 
parameters, and the count argument specifies the target count
for the specified node.  From this info, we compute the target
global count for the hstate and construct a nodes_allowed node
mask contain only the specified node.

Setting the node specific nr_hugepages via the per node attribute
effectively ignores any task mempolicy or cpuset constraints.

With this patch:

(me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
./  ../  free_hugepages  nr_hugepages  surplus_hugepages

Starting from:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     0
Node 2 HugePages_Free:      0
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 0

Allocate 16 persistent huge pages on node 2:
(me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

[Note that this is equivalent to:
	numactl -m 2 hugeadmin --pool-pages-min 2M:+16
]

Yields:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:    16
Node 2 HugePages_Free:     16
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 16

Global controls work as expected--reduce pool to 8 persistent huge pages:
(me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     8
Node 2 HugePages_Free:      8
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 drivers/base/node.c  |   33 ++++++
 include/linux/node.h |    8 +
 include/linux/numa.h |    2 
 mm/hugetlb.c         |  245 ++++++++++++++++++++++++++++++++++++++++++++-------
 4 files changed, 258 insertions(+), 30 deletions(-)

Index: linux-2.6.31-rc7-mmotm-090827-0057/drivers/base/node.c
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/drivers/base/node.c	2009-08-28 09:21:17.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/drivers/base/node.c	2009-08-28 09:21:31.000000000 -0400
@@ -177,6 +177,37 @@ static ssize_t node_read_distance(struct
 }
 static SYSDEV_ATTR(distance, S_IRUGO, node_read_distance, NULL);
 
+/*
+ * hugetlbfs per node attributes registration interface:
+ * When/if hugetlb[fs] subsystem initializes [sometime after this module],
+ * it will register it's per node attributes for all nodes on-line at that
+ * point.  It will also call register_hugetlbfs_with_node(), below, to
+ * register it's attribute registration functions with this node driver.
+ * Once these hooks have been initialized, the node driver will call into
+ * the hugetlb module to [un]register attributes for hot-plugged nodes.
+ */
+NODE_REGISTRATION_FUNC __hugetlb_register_node;
+NODE_REGISTRATION_FUNC __hugetlb_unregister_node;
+
+static inline void hugetlb_register_node(struct node *node)
+{
+	if (__hugetlb_register_node)
+		__hugetlb_register_node(node);
+}
+
+static inline void hugetlb_unregister_node(struct node *node)
+{
+	if (__hugetlb_unregister_node)
+		__hugetlb_unregister_node(node);
+}
+
+void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC doregister,
+                                  NODE_REGISTRATION_FUNC unregister)
+{
+	__hugetlb_register_node   = doregister;
+	__hugetlb_unregister_node = unregister;
+}
+
 
 /*
  * register_node - Setup a sysfs device for a node.
@@ -200,6 +231,7 @@ int register_node(struct node *node, int
 		sysdev_create_file(&node->sysdev, &attr_distance);
 
 		scan_unevictable_register_node(node);
+		hugetlb_register_node(node);
 	}
 	return error;
 }
@@ -220,6 +252,7 @@ void unregister_node(struct node *node)
 	sysdev_remove_file(&node->sysdev, &attr_distance);
 
 	scan_unevictable_unregister_node(node);
+	hugetlb_unregister_node(node);
 
 	sysdev_unregister(&node->sysdev);
 }
Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/hugetlb.c	2009-08-28 09:21:28.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c	2009-08-28 09:21:31.000000000 -0400
@@ -24,6 +24,7 @@
 #include <asm/io.h>
 
 #include <linux/hugetlb.h>
+#include <linux/node.h>
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
@@ -1245,7 +1246,8 @@ static int adjust_pool_surplus(struct hs
 }
 
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
-static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
+static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
+								int nid)
 {
 	unsigned long min_count, ret;
 	nodemask_t *nodes_allowed;
@@ -1253,7 +1255,21 @@ static unsigned long set_max_huge_pages(
 	if (h->order >= MAX_ORDER)
 		return h->max_huge_pages;
 
-	nodes_allowed = huge_mpol_nodes_allowed();
+	if (nid == NO_NODEID_SPECIFIED)
+		nodes_allowed = huge_mpol_nodes_allowed();
+	else {
+		/*
+		 * incoming 'count' is for node 'nid' only, so
+		 * adjust count to global, but restrict alloc/free
+		 * to the specified node.
+		 */
+		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
+		nodes_allowed = alloc_nodemask_of_node(nid);
+		if (!nodes_allowed)
+			printk(KERN_WARNING "%s unable to allocate allowed "
+			       "nodes mask for huge page allocation/free.  "
+			       "Falling back to default.\n", current->comm);
+	}
 
 	/*
 	 * Increase the pool size
@@ -1329,51 +1345,71 @@ out:
 static struct kobject *hugepages_kobj;
 static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
 
-static struct hstate *kobj_to_hstate(struct kobject *kobj)
+static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp);
+
+static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
 {
 	int i;
+
 	for (i = 0; i < HUGE_MAX_HSTATE; i++)
-		if (hstate_kobjs[i] == kobj)
+		if (hstate_kobjs[i] == kobj) {
+			if (nidp)
+				*nidp = NO_NODEID_SPECIFIED;
 			return &hstates[i];
-	BUG();
-	return NULL;
+		}
+
+	return kobj_to_node_hstate(kobj, nidp);
 }
 
 static ssize_t nr_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->nr_huge_pages);
+	struct hstate *h;
+	unsigned long nr_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid == NO_NODEID_SPECIFIED)
+		nr_huge_pages = h->nr_huge_pages;
+	else
+		nr_huge_pages = h->nr_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", nr_huge_pages);
 }
+
 static ssize_t nr_hugepages_store(struct kobject *kobj,
-		struct kobj_attribute *attr, const char *buf, size_t count)
+		struct kobj_attribute *attr, const char *buf, size_t len)
 {
+	unsigned long count;
+	struct hstate *h;
+	int nid;
 	int err;
-	unsigned long input;
-	struct hstate *h = kobj_to_hstate(kobj);
 
-	err = strict_strtoul(buf, 10, &input);
+	err = strict_strtoul(buf, 10, &count);
 	if (err)
 		return 0;
 
-	h->max_huge_pages = set_max_huge_pages(h, input);
+	h = kobj_to_hstate(kobj, &nid);
+	h->max_huge_pages = set_max_huge_pages(h, count, nid);
 
-	return count;
+	return len;
 }
 HSTATE_ATTR(nr_hugepages);
 
 static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
+
 	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
 }
+
 static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
 		struct kobj_attribute *attr, const char *buf, size_t count)
 {
 	int err;
 	unsigned long input;
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
 
 	err = strict_strtoul(buf, 10, &input);
 	if (err)
@@ -1390,15 +1426,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
 static ssize_t free_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->free_huge_pages);
+	struct hstate *h;
+	unsigned long free_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid == NO_NODEID_SPECIFIED)
+		free_huge_pages = h->free_huge_pages;
+	else
+		free_huge_pages = h->free_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", free_huge_pages);
 }
 HSTATE_ATTR_RO(free_hugepages);
 
 static ssize_t resv_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
 	return sprintf(buf, "%lu\n", h->resv_huge_pages);
 }
 HSTATE_ATTR_RO(resv_hugepages);
@@ -1406,8 +1451,17 @@ HSTATE_ATTR_RO(resv_hugepages);
 static ssize_t surplus_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
+	struct hstate *h;
+	unsigned long surplus_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid == NO_NODEID_SPECIFIED)
+		surplus_huge_pages = h->surplus_huge_pages;
+	else
+		surplus_huge_pages = h->surplus_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", surplus_huge_pages);
 }
 HSTATE_ATTR_RO(surplus_hugepages);
 
@@ -1424,19 +1478,21 @@ static struct attribute_group hstate_att
 	.attrs = hstate_attrs,
 };
 
-static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
+static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
+				struct kobject *parent,
+				struct kobject **hstate_kobjs,
+				struct attribute_group *hstate_attr_group)
 {
 	int retval;
+	int hi = h - hstates;
 
-	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
-							hugepages_kobj);
-	if (!hstate_kobjs[h - hstates])
+	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
+	if (!hstate_kobjs[hi])
 		return -ENOMEM;
 
-	retval = sysfs_create_group(hstate_kobjs[h - hstates],
-							&hstate_attr_group);
+	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
 	if (retval)
-		kobject_put(hstate_kobjs[h - hstates]);
+		kobject_put(hstate_kobjs[hi]);
 
 	return retval;
 }
@@ -1451,17 +1507,143 @@ static void __init hugetlb_sysfs_init(vo
 		return;
 
 	for_each_hstate(h) {
-		err = hugetlb_sysfs_add_hstate(h);
+		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
+					 hstate_kobjs, &hstate_attr_group);
 		if (err)
 			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
 								h->name);
 	}
 }
 
+#ifdef CONFIG_NUMA
+
+struct node_hstate {
+	struct kobject		*hugepages_kobj;
+	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
+};
+struct node_hstate node_hstates[MAX_NUMNODES];
+
+static struct attribute *per_node_hstate_attrs[] = {
+	&nr_hugepages_attr.attr,
+	&free_hugepages_attr.attr,
+	&surplus_hugepages_attr.attr,
+	NULL,
+};
+
+static struct attribute_group per_node_hstate_attr_group = {
+	.attrs = per_node_hstate_attrs,
+};
+
+static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		struct node_hstate *nhs = &node_hstates[nid];
+		int i;
+		for (i = 0; i < HUGE_MAX_HSTATE; i++)
+			if (nhs->hstate_kobjs[i] == kobj) {
+				if (nidp)
+					*nidp = nid;
+				return &hstates[i];
+			}
+	}
+
+	BUG();
+	return NULL;
+}
+
+void hugetlb_unregister_node(struct node *node)
+{
+	struct hstate *h;
+	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
+
+	if (!nhs->hugepages_kobj)
+		return;
+
+	for_each_hstate(h)
+		if (nhs->hstate_kobjs[h - hstates]) {
+			kobject_put(nhs->hstate_kobjs[h - hstates]);
+			nhs->hstate_kobjs[h - hstates] = NULL;
+		}
+
+	kobject_put(nhs->hugepages_kobj);
+	nhs->hugepages_kobj = NULL;
+}
+
+static void hugetlb_unregister_all_nodes(void)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++)
+		hugetlb_unregister_node(&node_devices[nid]);
+
+	register_hugetlbfs_with_node(NULL, NULL);
+}
+
+void hugetlb_register_node(struct node *node)
+{
+	struct hstate *h;
+	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
+	int err;
+
+	if (nhs->hugepages_kobj)
+		return;		/* already allocated */
+
+	nhs->hugepages_kobj = kobject_create_and_add("hugepages",
+							&node->sysdev.kobj);
+	if (!nhs->hugepages_kobj)
+		return;
+
+	for_each_hstate(h) {
+		err = hugetlb_sysfs_add_hstate(h, nhs->hugepages_kobj,
+						nhs->hstate_kobjs,
+						&per_node_hstate_attr_group);
+		if (err) {
+			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
+					" for node %d\n",
+						h->name, node->sysdev.id);
+			hugetlb_unregister_node(node);
+			break;
+		}
+	}
+}
+
+static void hugetlb_register_all_nodes(void)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		struct node *node = &node_devices[nid];
+		if (node->sysdev.id == nid)
+			hugetlb_register_node(node);
+	}
+
+	register_hugetlbfs_with_node(hugetlb_register_node,
+                                     hugetlb_unregister_node);
+}
+#else	/* !CONFIG_NUMA */
+
+static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
+{
+	BUG();
+	if (nidp)
+		*nidp = -1;
+	return NULL;
+}
+
+static void hugetlb_unregister_all_nodes(void) { }
+
+static void hugetlb_register_all_nodes(void) { }
+
+#endif
+
 static void __exit hugetlb_exit(void)
 {
 	struct hstate *h;
 
+	hugetlb_unregister_all_nodes();
+
 	for_each_hstate(h) {
 		kobject_put(hstate_kobjs[h - hstates]);
 	}
@@ -1496,6 +1678,8 @@ static int __init hugetlb_init(void)
 
 	hugetlb_sysfs_init();
 
+	hugetlb_register_all_nodes();
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -1598,7 +1782,8 @@ int hugetlb_sysctl_handler(struct ctl_ta
 	proc_doulongvec_minmax(table, write, buffer, length, ppos);
 
 	if (write)
-		h->max_huge_pages = set_max_huge_pages(h, tmp);
+		h->max_huge_pages = set_max_huge_pages(h, tmp,
+		                                       NO_NODEID_SPECIFIED);
 
 	return 0;
 }
Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/numa.h	2009-08-28 09:21:17.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h	2009-08-28 09:21:31.000000000 -0400
@@ -10,4 +10,6 @@
 
 #define MAX_NUMNODES    (1 << NODES_SHIFT)
 
+#define NO_NODEID_SPECIFIED	(-1)
+
 #endif /* _LINUX_NUMA_H */
Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/node.h
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/node.h	2009-08-28 09:21:17.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/node.h	2009-08-28 09:21:31.000000000 -0400
@@ -28,6 +28,7 @@ struct node {
 
 struct memory_block;
 extern struct node node_devices[];
+typedef  void (*NODE_REGISTRATION_FUNC)(struct node *);
 
 extern int register_node(struct node *, int, struct node *);
 extern void unregister_node(struct node *node);
@@ -39,6 +40,8 @@ extern int unregister_cpu_under_node(uns
 extern int register_mem_sect_under_node(struct memory_block *mem_blk,
 						int nid);
 extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk);
+extern void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC doregister,
+                                         NODE_REGISTRATION_FUNC unregister);
 #else
 static inline int register_one_node(int nid)
 {
@@ -65,6 +68,11 @@ static inline int unregister_mem_sect_un
 {
 	return 0;
 }
+
+static inline void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC reg,
+                                                NODE_REGISTRATION_FUNC unreg)
+{
+}
 #endif
 
 #define to_node(sys_device) container_of(sys_device, struct node, sysdev)

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-08-28 16:03 ` Lee Schermerhorn
@ 2009-08-28 16:03   ` Lee Schermerhorn
  -1 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-08-28 16:03 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.

Against: 2.6.31-rc7-mmotm-090827-0057

V2:  Add brief description of per node attributes.

This patch updates the kernel huge tlb documentation to describe the
numa memory policy based huge page management.  Additionaly, the patch
includes a fair amount of rework to improve consistency, eliminate
duplication and set the context for documenting the memory policy
interaction.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/vm/hugetlbpage.txt |  257 ++++++++++++++++++++++++++-------------
 1 file changed, 172 insertions(+), 85 deletions(-)

Index: linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/Documentation/vm/hugetlbpage.txt	2009-08-28 09:21:16.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt	2009-08-28 09:21:32.000000000 -0400
@@ -11,23 +11,21 @@ This optimization is more critical now a
 (several GBs) are more readily available.
 
 Users can use the huge page support in Linux kernel by either using the mmap
-system call or standard SYSv shared memory system calls (shmget, shmat).
+system call or standard SYSV shared memory system calls (shmget, shmat).
 
 First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
 (present under "File systems") and CONFIG_HUGETLB_PAGE (selected
 automatically when CONFIG_HUGETLBFS is selected) configuration
 options.
 
-The kernel built with huge page support should show the number of configured
-huge pages in the system by running the "cat /proc/meminfo" command.
+The /proc/meminfo file provides information about the total number of hugetlb
+pages preallocated in the kernel's huge page pool.  It also displays
+information about the number of free, reserved and surplus huge pages and the
+[default] huge page size.  The huge page size is needed for generating the
+proper alignment and size of the arguments to system calls that map huge page
+regions.
 
-/proc/meminfo also provides information about the total number of hugetlb
-pages configured in the kernel.  It also displays information about the
-number of free hugetlb pages at any time.  It also displays information about
-the configured huge page size - this is needed for generating the proper
-alignment and size of the arguments to the above system calls.
-
-The output of "cat /proc/meminfo" will have lines like:
+The output of "cat /proc/meminfo" will include lines like:
 
 .....
 HugePages_Total: vvv
@@ -53,26 +51,25 @@ HugePages_Surp  is short for "surplus,"
 /proc/filesystems should also show a filesystem of type "hugetlbfs" configured
 in the kernel.
 
-/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
-pages in the kernel.  Super user can dynamically request more (or free some
-pre-configured) huge pages.
-The allocation (or deallocation) of hugetlb pages is possible only if there are
-enough physically contiguous free pages in system (freeing of huge pages is
-possible only if there are enough hugetlb pages free that can be transferred
-back to regular memory pool).
-
-Pages that are used as hugetlb pages are reserved inside the kernel and cannot
-be used for other purposes.
-
-Once the kernel with Hugetlb page support is built and running, a user can
-use either the mmap system call or shared memory system calls to start using
-the huge pages.  It is required that the system administrator preallocate
-enough memory for huge page purposes.
-
-The administrator can preallocate huge pages on the kernel boot command line by
-specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
-requested.  This is the most reliable method for preallocating huge pages as
-memory has not yet become fragmented.
+/proc/sys/vm/nr_hugepages indicates the current number of huge pages pre-
+allocated in the kernel's huge page pool.  These are called "persistent"
+huge pages.  A user with root privileges can dynamically allocate more or
+free some persistent huge pages by increasing or decreasing the value of
+'nr_hugepages'.
+
+Pages that are used as huge pages are reserved inside the kernel and cannot
+be used for other purposes.  Huge pages can not be swapped out under
+memory pressure.
+
+Once a number of huge pages have been pre-allocated to the kernel huge page
+pool, a user with appropriate privilege can use either the mmap system call
+or shared memory system calls to use the huge pages.  See the discussion of
+Using Huge Pages, below
+
+The administrator can preallocate persistent huge pages on the kernel boot
+command line by specifying the "hugepages=N" parameter, where 'N' = the
+number of requested huge pages requested.  This is the most reliable method
+or preallocating huge pages as memory has not yet become fragmented.
 
 Some platforms support multiple huge page sizes.  To preallocate huge pages
 of a specific size, one must preceed the huge pages boot command parameters
@@ -80,19 +77,24 @@ with a huge page size selection paramete
 be specified in bytes with optional scale suffix [kKmMgG].  The default huge
 page size may be selected with the "default_hugepagesz=<size>" boot parameter.
 
-/proc/sys/vm/nr_hugepages indicates the current number of configured [default
-size] hugetlb pages in the kernel.  Super user can dynamically request more
-(or free some pre-configured) huge pages.
-
-Use the following command to dynamically allocate/deallocate default sized
-huge pages:
+When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
+indicates the current number of pre-allocated huge pages of the default size.
+Thus, one can use the following command to dynamically allocate/deallocate
+default sized persistent huge pages:
 
 	echo 20 > /proc/sys/vm/nr_hugepages
 
-This command will try to configure 20 default sized huge pages in the system.
+This command will try to adjust the number of default sized huge pages in the
+huge page pool to 20, allocating or freeing huge pages, as required.
+
 On a NUMA platform, the kernel will attempt to distribute the huge page pool
-over the all on-line nodes.  These huge pages, allocated when nr_hugepages
-is increased, are called "persistent huge pages".
+over the all the nodes specified by the NUMA memory policy of the task that
+modifies nr_hugepages that contain sufficient available contiguous memory.
+These nodes are called the huge pages "allowed nodes".  The default for the
+huge pages allowed nodes--when the task has default memory policy--is all
+on-line nodes.  See the discussion below of the interaction of task memory
+policy, cpusets and per node attributes with the allocation and freeing of
+persistent huge pages.
 
 The success or failure of huge page allocation depends on the amount of
 physically contiguous memory that is preset in system at the time of the
@@ -101,11 +103,11 @@ some nodes in a NUMA system, it will att
 allocating extra pages on other nodes with sufficient available contiguous
 memory, if any.
 
-System administrators may want to put this command in one of the local rc init
-files.  This will enable the kernel to request huge pages early in the boot
-process when the possibility of getting physical contiguous pages is still
-very high.  Administrators can verify the number of huge pages actually
-allocated by checking the sysctl or meminfo.  To check the per node
+System administrators may want to put this command in one of the local rc
+init files.  This will enable the kernel to preallocate huge pages early in
+the boot process when the possibility of getting physical contiguous pages
+is still very high.  Administrators can verify the number of huge pages
+actually allocated by checking the sysctl or meminfo.  To check the per node
 distribution of huge pages in a NUMA system, use:
 
 	cat /sys/devices/system/node/node*/meminfo | fgrep Huge
@@ -113,39 +115,40 @@ distribution of huge pages in a NUMA sys
 /proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
 huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
 requested by applications.  Writing any non-zero value into this file
-indicates that the hugetlb subsystem is allowed to try to obtain "surplus"
-huge pages from the buddy allocator, when the normal pool is exhausted. As
-these surplus huge pages go out of use, they are freed back to the buddy
-allocator.
+indicates that the hugetlb subsystem is allowed to try to obtain that
+number of "surplus" huge pages from the kernel's normal page pool, when the
+persistent huge page pool is exhausted. As these surplus huge pages become
+unused, they are freed back to the kernel's normal page pool.
 
-When increasing the huge page pool size via nr_hugepages, any surplus
+When increasing the huge page pool size via nr_hugepages, any existing surplus
 pages will first be promoted to persistent huge pages.  Then, additional
 huge pages will be allocated, if necessary and if possible, to fulfill
-the new huge page pool size.
+the new persistent huge page pool size.
 
 The administrator may shrink the pool of preallocated huge pages for
 the default huge page size by setting the nr_hugepages sysctl to a
 smaller value.  The kernel will attempt to balance the freeing of huge pages
-across all on-line nodes.  Any free huge pages on the selected nodes will
-be freed back to the buddy allocator.
-
-Caveat: Shrinking the pool via nr_hugepages such that it becomes less
-than the number of huge pages in use will convert the balance to surplus
-huge pages even if it would exceed the overcommit value.  As long as
-this condition holds, however, no more surplus huge pages will be
-allowed on the system until one of the two sysctls are increased
-sufficiently, or the surplus huge pages go out of use and are freed.
+across all nodes in the memory policy of the task modifying nr_hugepages.
+Any free huge pages on the selected nodes will be freed back to the kernel's
+normal page pool.
+
+Caveat: Shrinking the persistent huge page pool via nr_hugepages such that
+it becomes less than the number of huge pages in use will convert the balance
+of the in-use huge pages to surplus huge pages.  This will occur even if
+the number of surplus pages it would exceed the overcommit value.  As long as
+this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is
+increased sufficiently, or the surplus huge pages go out of use and are freed--
+no more surplus huge pages will be allowed to be allocated.
 
 With support for multiple huge page pools at run-time available, much of
-the huge page userspace interface has been duplicated in sysfs. The above
-information applies to the default huge page size which will be
-controlled by the /proc interfaces for backwards compatibility. The root
-huge page control directory in sysfs is:
+the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs.
+The /proc interfaces discussed above have been retained for backwards
+compatibility. The root huge page control directory in sysfs is:
 
 	/sys/kernel/mm/hugepages
 
 For each huge page size supported by the running kernel, a subdirectory
-will exist, of the form
+will exist, of the form:
 
 	hugepages-${size}kB
 
@@ -159,6 +162,98 @@ Inside each of these directories, the sa
 
 which function as described above for the default huge page-sized case.
 
+
+Interaction of Task Memory Policy with Huge Page Allocation/Freeing:
+
+Whether huge pages are allocated and freed via the /proc interface or
+the /sysfs interface, the NUMA nodes from which huge pages are allocated
+or freed are controlled by the NUMA memory policy of the task that modifies
+the nr_hugepages parameter.  [nr_overcommit_hugepages is a global limit.]
+
+The recommended method to allocate or free huge pages to/from the kernel
+huge page pool, using the nr_hugepages example above, is:
+
+    numactl --interleave <node-list> echo 20 >/proc/sys/vm/nr_hugepages.
+
+or, more succinctly:
+
+    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages.
+
+This will allocate or free abs(20 - nr_hugepages) to or from the nodes
+specified in <node-list>, depending on whether nr_hugepages is initially
+less than or greater than 20, respectively.  No huge pages will be
+allocated nor freed on any node not included in the specified <node-list>.
+
+Any memory policy mode--bind, preferred, local or interleave--may be
+used.  The effect on persistent huge page allocation will be as follows:
+
+1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
+   persistent huge pages will be distributed across the node or nodes
+   specified in the mempolicy as if "interleave" had been specified.
+   However, if a node in the policy does not contain sufficient contiguous
+   memory for a huge page, the allocation will not "fallback" to the nearest
+   neighbor node with sufficient contiguous memory.  To do this would cause
+   undesirable imbalance in the distribution of the huge page pool, or
+   possibly, allocation of persistent huge pages on nodes not allowed by
+   the task's memory policy.
+
+2) One or more nodes may be specified with the bind or interleave policy.
+   If more than one node is specified with the preferred policy, only the
+   lowest numeric id will be used.  Local policy will select the node where
+   the task is running at the time the nodes_allowed mask is constructed.
+
+3) For local policy to be deterministic, the task must be bound to a cpu or
+   cpus in a single node.  Otherwise, the task could be migrated to some
+   other node at any time after launch and the resulting node will be
+   indeterminate.  Thus, local policy is not very useful for this purpose.
+   Any of the other mempolicy modes may be used to specify a single node.
+
+4) The nodes allowed mask will be derived from any non-default task mempolicy,
+   whether this policy was set explicitly by the task itself or one of its
+   ancestors, such as numactl.  This means that if the task is invoked from a
+   shell with non-default policy, that policy will be used.  One can specify a
+   node list of "all" with numactl --interleave or --membind [-m] to achieve
+   interleaving over all nodes in the system or cpuset.
+
+5) Any task mempolicy specifed--e.g., using numactl--will be constrained by
+   the resource limits of any cpuset in which the task runs.  Thus, there will
+   be no way for a task with non-default policy running in a cpuset with a
+   subset of the system nodes to allocate huge pages outside the cpuset
+   without first moving to a cpuset that contains all of the desired nodes.
+
+6) Hugepages allocated at boot time always use the node_online_map.
+
+
+Per Node Hugepages Attributes
+
+A subset of the contents of the root huge page control directory in sysfs,
+described above, has been replicated under each "node" system device in:
+
+	/sys/devices/system/node/node[0-9]*/hugepages/
+
+Under this directory, the subdirectory for each supported huge page size
+contains the following attribute files:
+
+	nr_hugepages
+	free_hugepages
+	surplus_hugepages
+
+The free_' and surplus_' attribute files are read-only.  They return the number
+of free and surplus [overcommitted] huge pages, respectively, on the parent
+node.
+
+The nr_hugepages attribute will return the total number of huge pages on the
+specified node.  When this attribute is written, the number of persistent huge
+pages on the parent node will be adjusted to the specified value, if sufficient
+resources exist, regardless of the task's mempolicy or cpuset constraints.
+
+Note that the number of overcommit and reserve pages remain global quantities,
+as we don't know until fault time, when the faulting task's mempolicy is applied,
+from which node the huge page allocation will be attempted.
+
+
+Using Huge Pages:
+
 If the user applications are going to request huge pages using mmap system
 call, then it is required that system administrator mount a file system of
 type hugetlbfs:
@@ -206,9 +301,11 @@ map_hugetlb.c.
  * requesting huge pages.
  *
  * For the ia64 architecture, the Linux kernel reserves Region number 4 for
- * huge pages.  That means the addresses starting with 0x800000... will need
- * to be specified.  Specifying a fixed address is not required on ppc64,
- * i386 or x86_64.
+ * huge pages.  That means that if one requires a fixed address, a huge page
+ * aligned address starting with 0x800000... will be required.  If a fixed
+ * address is not required, the kernel will select an address in the proper
+ * range.
+ * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
  *
  * Note: The default shared memory limit is quite low on many kernels,
  * you may need to increase it via:
@@ -237,14 +334,8 @@ map_hugetlb.c.
 
 #define dprintf(x)  printf(x)
 
-/* Only ia64 requires this */
-#ifdef __ia64__
-#define ADDR (void *)(0x8000000000000000UL)
-#define SHMAT_FLAGS (SHM_RND)
-#else
-#define ADDR (void *)(0x0UL)
+#define ADDR (void *)(0x0UL)	/* let kernel choose address */
 #define SHMAT_FLAGS (0)
-#endif
 
 int main(void)
 {
@@ -302,10 +393,12 @@ int main(void)
  * example, the app is requesting memory of size 256MB that is backed by
  * huge pages.
  *
- * For ia64 architecture, Linux kernel reserves Region number 4 for huge pages.
- * That means the addresses starting with 0x800000... will need to be
- * specified.  Specifying a fixed address is not required on ppc64, i386
- * or x86_64.
+ * For the ia64 architecture, the Linux kernel reserves Region number 4 for
+ * huge pages.  That means that if one requires a fixed address, a huge page
+ * aligned address starting with 0x800000... will be required.  If a fixed
+ * address is not required, the kernel will select an address in the proper
+ * range.
+ * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
  */
 #include <stdlib.h>
 #include <stdio.h>
@@ -317,14 +410,8 @@ int main(void)
 #define LENGTH (256UL*1024*1024)
 #define PROTECTION (PROT_READ | PROT_WRITE)
 
-/* Only ia64 requires this */
-#ifdef __ia64__
-#define ADDR (void *)(0x8000000000000000UL)
-#define FLAGS (MAP_SHARED | MAP_FIXED)
-#else
-#define ADDR (void *)(0x0UL)
+#define ADDR (void *)(0x0UL)	/* let kernel choose address */
 #define FLAGS (MAP_SHARED)
-#endif
 
 void check_bytes(char *addr)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
@ 2009-08-28 16:03   ` Lee Schermerhorn
  0 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-08-28 16:03 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.

Against: 2.6.31-rc7-mmotm-090827-0057

V2:  Add brief description of per node attributes.

This patch updates the kernel huge tlb documentation to describe the
numa memory policy based huge page management.  Additionaly, the patch
includes a fair amount of rework to improve consistency, eliminate
duplication and set the context for documenting the memory policy
interaction.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/vm/hugetlbpage.txt |  257 ++++++++++++++++++++++++++-------------
 1 file changed, 172 insertions(+), 85 deletions(-)

Index: linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-0057.orig/Documentation/vm/hugetlbpage.txt	2009-08-28 09:21:16.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt	2009-08-28 09:21:32.000000000 -0400
@@ -11,23 +11,21 @@ This optimization is more critical now a
 (several GBs) are more readily available.
 
 Users can use the huge page support in Linux kernel by either using the mmap
-system call or standard SYSv shared memory system calls (shmget, shmat).
+system call or standard SYSV shared memory system calls (shmget, shmat).
 
 First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
 (present under "File systems") and CONFIG_HUGETLB_PAGE (selected
 automatically when CONFIG_HUGETLBFS is selected) configuration
 options.
 
-The kernel built with huge page support should show the number of configured
-huge pages in the system by running the "cat /proc/meminfo" command.
+The /proc/meminfo file provides information about the total number of hugetlb
+pages preallocated in the kernel's huge page pool.  It also displays
+information about the number of free, reserved and surplus huge pages and the
+[default] huge page size.  The huge page size is needed for generating the
+proper alignment and size of the arguments to system calls that map huge page
+regions.
 
-/proc/meminfo also provides information about the total number of hugetlb
-pages configured in the kernel.  It also displays information about the
-number of free hugetlb pages at any time.  It also displays information about
-the configured huge page size - this is needed for generating the proper
-alignment and size of the arguments to the above system calls.
-
-The output of "cat /proc/meminfo" will have lines like:
+The output of "cat /proc/meminfo" will include lines like:
 
 .....
 HugePages_Total: vvv
@@ -53,26 +51,25 @@ HugePages_Surp  is short for "surplus,"
 /proc/filesystems should also show a filesystem of type "hugetlbfs" configured
 in the kernel.
 
-/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
-pages in the kernel.  Super user can dynamically request more (or free some
-pre-configured) huge pages.
-The allocation (or deallocation) of hugetlb pages is possible only if there are
-enough physically contiguous free pages in system (freeing of huge pages is
-possible only if there are enough hugetlb pages free that can be transferred
-back to regular memory pool).
-
-Pages that are used as hugetlb pages are reserved inside the kernel and cannot
-be used for other purposes.
-
-Once the kernel with Hugetlb page support is built and running, a user can
-use either the mmap system call or shared memory system calls to start using
-the huge pages.  It is required that the system administrator preallocate
-enough memory for huge page purposes.
-
-The administrator can preallocate huge pages on the kernel boot command line by
-specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
-requested.  This is the most reliable method for preallocating huge pages as
-memory has not yet become fragmented.
+/proc/sys/vm/nr_hugepages indicates the current number of huge pages pre-
+allocated in the kernel's huge page pool.  These are called "persistent"
+huge pages.  A user with root privileges can dynamically allocate more or
+free some persistent huge pages by increasing or decreasing the value of
+'nr_hugepages'.
+
+Pages that are used as huge pages are reserved inside the kernel and cannot
+be used for other purposes.  Huge pages can not be swapped out under
+memory pressure.
+
+Once a number of huge pages have been pre-allocated to the kernel huge page
+pool, a user with appropriate privilege can use either the mmap system call
+or shared memory system calls to use the huge pages.  See the discussion of
+Using Huge Pages, below
+
+The administrator can preallocate persistent huge pages on the kernel boot
+command line by specifying the "hugepages=N" parameter, where 'N' = the
+number of requested huge pages requested.  This is the most reliable method
+or preallocating huge pages as memory has not yet become fragmented.
 
 Some platforms support multiple huge page sizes.  To preallocate huge pages
 of a specific size, one must preceed the huge pages boot command parameters
@@ -80,19 +77,24 @@ with a huge page size selection paramete
 be specified in bytes with optional scale suffix [kKmMgG].  The default huge
 page size may be selected with the "default_hugepagesz=<size>" boot parameter.
 
-/proc/sys/vm/nr_hugepages indicates the current number of configured [default
-size] hugetlb pages in the kernel.  Super user can dynamically request more
-(or free some pre-configured) huge pages.
-
-Use the following command to dynamically allocate/deallocate default sized
-huge pages:
+When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
+indicates the current number of pre-allocated huge pages of the default size.
+Thus, one can use the following command to dynamically allocate/deallocate
+default sized persistent huge pages:
 
 	echo 20 > /proc/sys/vm/nr_hugepages
 
-This command will try to configure 20 default sized huge pages in the system.
+This command will try to adjust the number of default sized huge pages in the
+huge page pool to 20, allocating or freeing huge pages, as required.
+
 On a NUMA platform, the kernel will attempt to distribute the huge page pool
-over the all on-line nodes.  These huge pages, allocated when nr_hugepages
-is increased, are called "persistent huge pages".
+over the all the nodes specified by the NUMA memory policy of the task that
+modifies nr_hugepages that contain sufficient available contiguous memory.
+These nodes are called the huge pages "allowed nodes".  The default for the
+huge pages allowed nodes--when the task has default memory policy--is all
+on-line nodes.  See the discussion below of the interaction of task memory
+policy, cpusets and per node attributes with the allocation and freeing of
+persistent huge pages.
 
 The success or failure of huge page allocation depends on the amount of
 physically contiguous memory that is preset in system at the time of the
@@ -101,11 +103,11 @@ some nodes in a NUMA system, it will att
 allocating extra pages on other nodes with sufficient available contiguous
 memory, if any.
 
-System administrators may want to put this command in one of the local rc init
-files.  This will enable the kernel to request huge pages early in the boot
-process when the possibility of getting physical contiguous pages is still
-very high.  Administrators can verify the number of huge pages actually
-allocated by checking the sysctl or meminfo.  To check the per node
+System administrators may want to put this command in one of the local rc
+init files.  This will enable the kernel to preallocate huge pages early in
+the boot process when the possibility of getting physical contiguous pages
+is still very high.  Administrators can verify the number of huge pages
+actually allocated by checking the sysctl or meminfo.  To check the per node
 distribution of huge pages in a NUMA system, use:
 
 	cat /sys/devices/system/node/node*/meminfo | fgrep Huge
@@ -113,39 +115,40 @@ distribution of huge pages in a NUMA sys
 /proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
 huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
 requested by applications.  Writing any non-zero value into this file
-indicates that the hugetlb subsystem is allowed to try to obtain "surplus"
-huge pages from the buddy allocator, when the normal pool is exhausted. As
-these surplus huge pages go out of use, they are freed back to the buddy
-allocator.
+indicates that the hugetlb subsystem is allowed to try to obtain that
+number of "surplus" huge pages from the kernel's normal page pool, when the
+persistent huge page pool is exhausted. As these surplus huge pages become
+unused, they are freed back to the kernel's normal page pool.
 
-When increasing the huge page pool size via nr_hugepages, any surplus
+When increasing the huge page pool size via nr_hugepages, any existing surplus
 pages will first be promoted to persistent huge pages.  Then, additional
 huge pages will be allocated, if necessary and if possible, to fulfill
-the new huge page pool size.
+the new persistent huge page pool size.
 
 The administrator may shrink the pool of preallocated huge pages for
 the default huge page size by setting the nr_hugepages sysctl to a
 smaller value.  The kernel will attempt to balance the freeing of huge pages
-across all on-line nodes.  Any free huge pages on the selected nodes will
-be freed back to the buddy allocator.
-
-Caveat: Shrinking the pool via nr_hugepages such that it becomes less
-than the number of huge pages in use will convert the balance to surplus
-huge pages even if it would exceed the overcommit value.  As long as
-this condition holds, however, no more surplus huge pages will be
-allowed on the system until one of the two sysctls are increased
-sufficiently, or the surplus huge pages go out of use and are freed.
+across all nodes in the memory policy of the task modifying nr_hugepages.
+Any free huge pages on the selected nodes will be freed back to the kernel's
+normal page pool.
+
+Caveat: Shrinking the persistent huge page pool via nr_hugepages such that
+it becomes less than the number of huge pages in use will convert the balance
+of the in-use huge pages to surplus huge pages.  This will occur even if
+the number of surplus pages it would exceed the overcommit value.  As long as
+this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is
+increased sufficiently, or the surplus huge pages go out of use and are freed--
+no more surplus huge pages will be allowed to be allocated.
 
 With support for multiple huge page pools at run-time available, much of
-the huge page userspace interface has been duplicated in sysfs. The above
-information applies to the default huge page size which will be
-controlled by the /proc interfaces for backwards compatibility. The root
-huge page control directory in sysfs is:
+the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs.
+The /proc interfaces discussed above have been retained for backwards
+compatibility. The root huge page control directory in sysfs is:
 
 	/sys/kernel/mm/hugepages
 
 For each huge page size supported by the running kernel, a subdirectory
-will exist, of the form
+will exist, of the form:
 
 	hugepages-${size}kB
 
@@ -159,6 +162,98 @@ Inside each of these directories, the sa
 
 which function as described above for the default huge page-sized case.
 
+
+Interaction of Task Memory Policy with Huge Page Allocation/Freeing:
+
+Whether huge pages are allocated and freed via the /proc interface or
+the /sysfs interface, the NUMA nodes from which huge pages are allocated
+or freed are controlled by the NUMA memory policy of the task that modifies
+the nr_hugepages parameter.  [nr_overcommit_hugepages is a global limit.]
+
+The recommended method to allocate or free huge pages to/from the kernel
+huge page pool, using the nr_hugepages example above, is:
+
+    numactl --interleave <node-list> echo 20 >/proc/sys/vm/nr_hugepages.
+
+or, more succinctly:
+
+    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages.
+
+This will allocate or free abs(20 - nr_hugepages) to or from the nodes
+specified in <node-list>, depending on whether nr_hugepages is initially
+less than or greater than 20, respectively.  No huge pages will be
+allocated nor freed on any node not included in the specified <node-list>.
+
+Any memory policy mode--bind, preferred, local or interleave--may be
+used.  The effect on persistent huge page allocation will be as follows:
+
+1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
+   persistent huge pages will be distributed across the node or nodes
+   specified in the mempolicy as if "interleave" had been specified.
+   However, if a node in the policy does not contain sufficient contiguous
+   memory for a huge page, the allocation will not "fallback" to the nearest
+   neighbor node with sufficient contiguous memory.  To do this would cause
+   undesirable imbalance in the distribution of the huge page pool, or
+   possibly, allocation of persistent huge pages on nodes not allowed by
+   the task's memory policy.
+
+2) One or more nodes may be specified with the bind or interleave policy.
+   If more than one node is specified with the preferred policy, only the
+   lowest numeric id will be used.  Local policy will select the node where
+   the task is running at the time the nodes_allowed mask is constructed.
+
+3) For local policy to be deterministic, the task must be bound to a cpu or
+   cpus in a single node.  Otherwise, the task could be migrated to some
+   other node at any time after launch and the resulting node will be
+   indeterminate.  Thus, local policy is not very useful for this purpose.
+   Any of the other mempolicy modes may be used to specify a single node.
+
+4) The nodes allowed mask will be derived from any non-default task mempolicy,
+   whether this policy was set explicitly by the task itself or one of its
+   ancestors, such as numactl.  This means that if the task is invoked from a
+   shell with non-default policy, that policy will be used.  One can specify a
+   node list of "all" with numactl --interleave or --membind [-m] to achieve
+   interleaving over all nodes in the system or cpuset.
+
+5) Any task mempolicy specifed--e.g., using numactl--will be constrained by
+   the resource limits of any cpuset in which the task runs.  Thus, there will
+   be no way for a task with non-default policy running in a cpuset with a
+   subset of the system nodes to allocate huge pages outside the cpuset
+   without first moving to a cpuset that contains all of the desired nodes.
+
+6) Hugepages allocated at boot time always use the node_online_map.
+
+
+Per Node Hugepages Attributes
+
+A subset of the contents of the root huge page control directory in sysfs,
+described above, has been replicated under each "node" system device in:
+
+	/sys/devices/system/node/node[0-9]*/hugepages/
+
+Under this directory, the subdirectory for each supported huge page size
+contains the following attribute files:
+
+	nr_hugepages
+	free_hugepages
+	surplus_hugepages
+
+The free_' and surplus_' attribute files are read-only.  They return the number
+of free and surplus [overcommitted] huge pages, respectively, on the parent
+node.
+
+The nr_hugepages attribute will return the total number of huge pages on the
+specified node.  When this attribute is written, the number of persistent huge
+pages on the parent node will be adjusted to the specified value, if sufficient
+resources exist, regardless of the task's mempolicy or cpuset constraints.
+
+Note that the number of overcommit and reserve pages remain global quantities,
+as we don't know until fault time, when the faulting task's mempolicy is applied,
+from which node the huge page allocation will be attempted.
+
+
+Using Huge Pages:
+
 If the user applications are going to request huge pages using mmap system
 call, then it is required that system administrator mount a file system of
 type hugetlbfs:
@@ -206,9 +301,11 @@ map_hugetlb.c.
  * requesting huge pages.
  *
  * For the ia64 architecture, the Linux kernel reserves Region number 4 for
- * huge pages.  That means the addresses starting with 0x800000... will need
- * to be specified.  Specifying a fixed address is not required on ppc64,
- * i386 or x86_64.
+ * huge pages.  That means that if one requires a fixed address, a huge page
+ * aligned address starting with 0x800000... will be required.  If a fixed
+ * address is not required, the kernel will select an address in the proper
+ * range.
+ * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
  *
  * Note: The default shared memory limit is quite low on many kernels,
  * you may need to increase it via:
@@ -237,14 +334,8 @@ map_hugetlb.c.
 
 #define dprintf(x)  printf(x)
 
-/* Only ia64 requires this */
-#ifdef __ia64__
-#define ADDR (void *)(0x8000000000000000UL)
-#define SHMAT_FLAGS (SHM_RND)
-#else
-#define ADDR (void *)(0x0UL)
+#define ADDR (void *)(0x0UL)	/* let kernel choose address */
 #define SHMAT_FLAGS (0)
-#endif
 
 int main(void)
 {
@@ -302,10 +393,12 @@ int main(void)
  * example, the app is requesting memory of size 256MB that is backed by
  * huge pages.
  *
- * For ia64 architecture, Linux kernel reserves Region number 4 for huge pages.
- * That means the addresses starting with 0x800000... will need to be
- * specified.  Specifying a fixed address is not required on ppc64, i386
- * or x86_64.
+ * For the ia64 architecture, the Linux kernel reserves Region number 4 for
+ * huge pages.  That means that if one requires a fixed address, a huge page
+ * aligned address starting with 0x800000... will be required.  If a fixed
+ * address is not required, the kernel will select an address in the proper
+ * range.
+ * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
  */
 #include <stdlib.h>
 #include <stdio.h>
@@ -317,14 +410,8 @@ int main(void)
 #define LENGTH (256UL*1024*1024)
 #define PROTECTION (PROT_READ | PROT_WRITE)
 
-/* Only ia64 requires this */
-#ifdef __ia64__
-#define ADDR (void *)(0x8000000000000000UL)
-#define FLAGS (MAP_SHARED | MAP_FIXED)
-#else
-#define ADDR (void *)(0x0UL)
+#define ADDR (void *)(0x0UL)	/* let kernel choose address */
 #define FLAGS (MAP_SHARED)
-#endif
 
 void check_bytes(char *addr)
 {

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] hugetlb:  derive huge pages nodes allowed from task mempolicy
  2009-08-28 16:03   ` Lee Schermerhorn
@ 2009-09-01 14:47     ` Mel Gorman
  -1 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-01 14:47 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Nishanth Aravamudan, David Rientjes, linux-numa,
	Adam Litke, Andy Whitcroft, eric.whitney

On Fri, Aug 28, 2009 at 12:03:32PM -0400, Lee Schermerhorn wrote:
> [PATCH 3/6] hugetlb:  derive huge pages nodes allowed from task mempolicy
> 
> Against: 2.6.31-rc7-mmotm-090827-0057
> 
> V2:
> + cleaned up comments, removed some deemed unnecessary,
>   add some suggested by review
> + removed check for !current in huge_mpol_nodes_allowed().
> + added 'current->comm' to warning message in huge_mpol_nodes_allowed().
> + added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to
>   catch out of range node id.
> + add examples to patch description
> 
> V3: Factored this patch from V2 patch 2/3
> 
> V4: added back missing "kfree(nodes_allowed)" in set_max_nr_hugepages()
> 
> V5: remove internal '\n' from printk in huge_mpol_nodes_allowed()
> 
> This patch derives a "nodes_allowed" node mask from the numa
> mempolicy of the task modifying the number of persistent huge
> pages to control the allocation, freeing and adjusting of surplus
> huge pages.  This mask is derived as follows:
> 
> * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
>   is produced.  This will cause the hugetlb subsystem to use
>   node_online_map as the "nodes_allowed".  This preserves the
>   behavior before this patch.
> * For "preferred" mempolicy, including explicit local allocation,
>   a nodemask with the single preferred node will be produced. 
>   "local" policy will NOT track any internode migrations of the
>   task adjusting nr_hugepages.
> * For "bind" and "interleave" policy, the mempolicy's nodemask
>   will be used.
> * Other than to inform the construction of the nodes_allowed node
>   mask, the actual mempolicy mode is ignored.  That is, all modes
>   behave like interleave over the resulting nodes_allowed mask
>   with no "fallback".
> 
> Notes:
> 
> 1) This patch introduces a subtle change in behavior:  huge page
>    allocation and freeing will be constrained by any mempolicy
>    that the task adjusting the huge page pool inherits from its
>    parent.  This policy could come from a distant ancestor.  The
>    adminstrator adjusting the huge page pool without explicitly
>    specifying a mempolicy via numactl might be surprised by this.
>    Additionaly, any mempolicy specified by numactl will be
>    constrained by the cpuset in which numactl is invoked.
>    Using sysfs per node hugepages attributes to adjust the per
>    node persistent huge pages count [subsequent patch] ignores
>    mempolicy and cpuset constraints.
> 
> 2) Hugepages allocated at boot time use the node_online_map.
>    An additional patch could implement a temporary boot time
>    huge pages nodes_allowed command line parameter.
> 
> 3) Using mempolicy to control persistent huge page allocation
>    and freeing requires no change to hugeadm when invoking
>    it via numactl, as shown in the examples below.  However,
>    hugeadm could be enhanced to take the allowed nodes as an
>    argument and set its task mempolicy itself.  This would allow
>    it to detect and warn about any non-default mempolicy that it
>    inherited from its parent, thus alleviating the issue described
>    in Note 1 above.
> 
> See the updated documentation [next patch] for more information
> about the implications of this patch.
> 
> Examples:
> 
> Starting with:
> 
> 	Node 0 HugePages_Total:     0
> 	Node 1 HugePages_Total:     0
> 	Node 2 HugePages_Total:     0
> 	Node 3 HugePages_Total:     0
> 
> Default behavior [with or without this patch] balances persistent
> hugepage allocation across nodes [with sufficient contiguous memory]:
> 
> 	hugeadm --pool-pages-min=2048Kb:32
> 
> yields:
> 
> 	Node 0 HugePages_Total:     8
> 	Node 1 HugePages_Total:     8
> 	Node 2 HugePages_Total:     8
> 	Node 3 HugePages_Total:     8
> 
> Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
> '--membind' because it allows multiple nodes to be specified
> and it's easy to type]--we can allocate huge pages on
> individual nodes or sets of nodes.  So, starting from the 
> condition above, with 8 huge pages per node:
> 
> 	numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8
> 
> yields:
> 
> 	Node 0 HugePages_Total:     8
> 	Node 1 HugePages_Total:     8
> 	Node 2 HugePages_Total:    16
> 	Node 3 HugePages_Total:     8
> 
> The incremental 8 huge pages were restricted to node 2 by the
> specified mempolicy.
> 
> Similarly, we can use mempolicy to free persistent huge pages
> from specified nodes:
> 
> 	numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8
> 
> yields:
> 
> 	Node 0 HugePages_Total:     4
> 	Node 1 HugePages_Total:     4
> 	Node 2 HugePages_Total:    16
> 	Node 3 HugePages_Total:     8
> 
> The 8 huge pages freed were balanced over nodes 0 and 1.
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 

This seems to behave as advertised and I could spot no other problems in
the code so make that a;

Reviewed-by: Mel Gorman <mel@csn.ul.ie>

As an aside, I think the ordering of the signed-offs, acks and reviews
are in the wrong order. The ordering of the lines is the order people
were involved in. You are the first as the author and I'm the second so
it should be

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>

rather than the other way around.

>  include/linux/mempolicy.h |    3 ++
>  mm/hugetlb.c              |   14 ++++++----
>  mm/mempolicy.c            |   61 ++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 73 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/mempolicy.c
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/mempolicy.c	2009-08-28 09:21:20.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/mm/mempolicy.c	2009-08-28 09:21:28.000000000 -0400
> @@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm
>  	}
>  	return zl;
>  }
> +
> +/*
> + * huge_mpol_nodes_allowed -- mempolicy extension for huge pages.
> + *
> + * Returns a [pointer to a] nodelist based on the current task's mempolicy
> + * to constraing the allocation and freeing of persistent huge pages
> + * 'Preferred', 'local' and 'interleave' mempolicy will behave more like
> + * 'bind' policy in this context.  An attempt to allocate a persistent huge
> + * page will never "fallback" to another node inside the buddy system
> + * allocator.
> + *
> + * If the task's mempolicy is "default" [NULL], just return NULL for
> + * default behavior.  Otherwise, extract the policy nodemask for 'bind'
> + * or 'interleave' policy or construct a nodemask for 'preferred' or
> + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t.
> + *
> + * N.B., it is the caller's responsibility to free a returned nodemask.
> + */
> +nodemask_t *huge_mpol_nodes_allowed(void)
> +{
> +	nodemask_t *nodes_allowed = NULL;
> +	struct mempolicy *mempolicy;
> +	int nid;
> +
> +	if (!current->mempolicy)
> +		return NULL;
> +
> +	mpol_get(current->mempolicy);
> +	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
> +	if (!nodes_allowed) {
> +		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
> +			"for huge page allocation.  Falling back to default.\n",
> +			current->comm);
> +		goto out;
> +	}
> +	nodes_clear(*nodes_allowed);
> +
> +	mempolicy = current->mempolicy;
> +	switch (mempolicy->mode) {
> +	case MPOL_PREFERRED:
> +		if (mempolicy->flags & MPOL_F_LOCAL)
> +			nid = numa_node_id();
> +		else
> +			nid = mempolicy->v.preferred_node;
> +		node_set(nid, *nodes_allowed);
> +		break;
> +
> +	case MPOL_BIND:
> +		/* Fall through */
> +	case MPOL_INTERLEAVE:
> +		*nodes_allowed =  mempolicy->v.nodes;
> +		break;
> +
> +	default:
> +		BUG();
> +	}
> +
> +out:
> +	mpol_put(current->mempolicy);
> +	return nodes_allowed;
> +}
>  #endif
>  
>  /* Allocate a page in interleaved policy.
> Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/mempolicy.h
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/mempolicy.h	2009-08-28 09:21:20.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/mempolicy.h	2009-08-28 09:21:28.000000000 -0400
> @@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str
>  extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
>  				unsigned long addr, gfp_t gfp_flags,
>  				struct mempolicy **mpol, nodemask_t **nodemask);
> +extern nodemask_t *huge_mpol_nodes_allowed(void);
>  extern unsigned slab_node(struct mempolicy *policy);
>  
>  extern enum zone_type policy_zone;
> @@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone
>  	return node_zonelist(0, gfp_flags);
>  }
>  
> +static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; }
> +
>  static inline int do_migrate_pages(struct mm_struct *mm,
>  			const nodemask_t *from_nodes,
>  			const nodemask_t *to_nodes, int flags)
> Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/hugetlb.c	2009-08-28 09:21:26.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c	2009-08-28 09:21:28.000000000 -0400
> @@ -1248,10 +1248,13 @@ static int adjust_pool_surplus(struct hs
>  static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
>  {
>  	unsigned long min_count, ret;
> +	nodemask_t *nodes_allowed;
>  
>  	if (h->order >= MAX_ORDER)
>  		return h->max_huge_pages;
>  
> +	nodes_allowed = huge_mpol_nodes_allowed();
> +
>  	/*
>  	 * Increase the pool size
>  	 * First take pages out of surplus state.  Then make up the
> @@ -1265,7 +1268,7 @@ static unsigned long set_max_huge_pages(
>  	 */
>  	spin_lock(&hugetlb_lock);
>  	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
> -		if (!adjust_pool_surplus(h, NULL, -1))
> +		if (!adjust_pool_surplus(h, nodes_allowed, -1))
>  			break;
>  	}
>  
> @@ -1276,7 +1279,7 @@ static unsigned long set_max_huge_pages(
>  		 * and reducing the surplus.
>  		 */
>  		spin_unlock(&hugetlb_lock);
> -		ret = alloc_fresh_huge_page(h, NULL);
> +		ret = alloc_fresh_huge_page(h, nodes_allowed);
>  		spin_lock(&hugetlb_lock);
>  		if (!ret)
>  			goto out;
> @@ -1300,18 +1303,19 @@ static unsigned long set_max_huge_pages(
>  	 */
>  	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
>  	min_count = max(count, min_count);
> -	try_to_free_low(h, min_count, NULL);
> +	try_to_free_low(h, min_count, nodes_allowed);
>  	while (min_count < persistent_huge_pages(h)) {
> -		if (!free_pool_huge_page(h, NULL, 0))
> +		if (!free_pool_huge_page(h, nodes_allowed, 0))
>  			break;
>  	}
>  	while (count < persistent_huge_pages(h)) {
> -		if (!adjust_pool_surplus(h, NULL, 1))
> +		if (!adjust_pool_surplus(h, nodes_allowed, 1))
>  			break;
>  	}
>  out:
>  	ret = persistent_huge_pages(h);
>  	spin_unlock(&hugetlb_lock);
> +	kfree(nodes_allowed);
>  	return ret;
>  }
>  
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] hugetlb:  derive huge pages nodes allowed from task mempolicy
@ 2009-09-01 14:47     ` Mel Gorman
  0 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-01 14:47 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Nishanth Aravamudan, David Rientjes, linux-numa,
	Adam Litke, Andy Whitcroft, eric.whitney

On Fri, Aug 28, 2009 at 12:03:32PM -0400, Lee Schermerhorn wrote:
> [PATCH 3/6] hugetlb:  derive huge pages nodes allowed from task mempolicy
> 
> Against: 2.6.31-rc7-mmotm-090827-0057
> 
> V2:
> + cleaned up comments, removed some deemed unnecessary,
>   add some suggested by review
> + removed check for !current in huge_mpol_nodes_allowed().
> + added 'current->comm' to warning message in huge_mpol_nodes_allowed().
> + added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to
>   catch out of range node id.
> + add examples to patch description
> 
> V3: Factored this patch from V2 patch 2/3
> 
> V4: added back missing "kfree(nodes_allowed)" in set_max_nr_hugepages()
> 
> V5: remove internal '\n' from printk in huge_mpol_nodes_allowed()
> 
> This patch derives a "nodes_allowed" node mask from the numa
> mempolicy of the task modifying the number of persistent huge
> pages to control the allocation, freeing and adjusting of surplus
> huge pages.  This mask is derived as follows:
> 
> * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
>   is produced.  This will cause the hugetlb subsystem to use
>   node_online_map as the "nodes_allowed".  This preserves the
>   behavior before this patch.
> * For "preferred" mempolicy, including explicit local allocation,
>   a nodemask with the single preferred node will be produced. 
>   "local" policy will NOT track any internode migrations of the
>   task adjusting nr_hugepages.
> * For "bind" and "interleave" policy, the mempolicy's nodemask
>   will be used.
> * Other than to inform the construction of the nodes_allowed node
>   mask, the actual mempolicy mode is ignored.  That is, all modes
>   behave like interleave over the resulting nodes_allowed mask
>   with no "fallback".
> 
> Notes:
> 
> 1) This patch introduces a subtle change in behavior:  huge page
>    allocation and freeing will be constrained by any mempolicy
>    that the task adjusting the huge page pool inherits from its
>    parent.  This policy could come from a distant ancestor.  The
>    adminstrator adjusting the huge page pool without explicitly
>    specifying a mempolicy via numactl might be surprised by this.
>    Additionaly, any mempolicy specified by numactl will be
>    constrained by the cpuset in which numactl is invoked.
>    Using sysfs per node hugepages attributes to adjust the per
>    node persistent huge pages count [subsequent patch] ignores
>    mempolicy and cpuset constraints.
> 
> 2) Hugepages allocated at boot time use the node_online_map.
>    An additional patch could implement a temporary boot time
>    huge pages nodes_allowed command line parameter.
> 
> 3) Using mempolicy to control persistent huge page allocation
>    and freeing requires no change to hugeadm when invoking
>    it via numactl, as shown in the examples below.  However,
>    hugeadm could be enhanced to take the allowed nodes as an
>    argument and set its task mempolicy itself.  This would allow
>    it to detect and warn about any non-default mempolicy that it
>    inherited from its parent, thus alleviating the issue described
>    in Note 1 above.
> 
> See the updated documentation [next patch] for more information
> about the implications of this patch.
> 
> Examples:
> 
> Starting with:
> 
> 	Node 0 HugePages_Total:     0
> 	Node 1 HugePages_Total:     0
> 	Node 2 HugePages_Total:     0
> 	Node 3 HugePages_Total:     0
> 
> Default behavior [with or without this patch] balances persistent
> hugepage allocation across nodes [with sufficient contiguous memory]:
> 
> 	hugeadm --pool-pages-min=2048Kb:32
> 
> yields:
> 
> 	Node 0 HugePages_Total:     8
> 	Node 1 HugePages_Total:     8
> 	Node 2 HugePages_Total:     8
> 	Node 3 HugePages_Total:     8
> 
> Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
> '--membind' because it allows multiple nodes to be specified
> and it's easy to type]--we can allocate huge pages on
> individual nodes or sets of nodes.  So, starting from the 
> condition above, with 8 huge pages per node:
> 
> 	numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8
> 
> yields:
> 
> 	Node 0 HugePages_Total:     8
> 	Node 1 HugePages_Total:     8
> 	Node 2 HugePages_Total:    16
> 	Node 3 HugePages_Total:     8
> 
> The incremental 8 huge pages were restricted to node 2 by the
> specified mempolicy.
> 
> Similarly, we can use mempolicy to free persistent huge pages
> from specified nodes:
> 
> 	numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8
> 
> yields:
> 
> 	Node 0 HugePages_Total:     4
> 	Node 1 HugePages_Total:     4
> 	Node 2 HugePages_Total:    16
> 	Node 3 HugePages_Total:     8
> 
> The 8 huge pages freed were balanced over nodes 0 and 1.
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 

This seems to behave as advertised and I could spot no other problems in
the code so make that a;

Reviewed-by: Mel Gorman <mel@csn.ul.ie>

As an aside, I think the ordering of the signed-offs, acks and reviews
are in the wrong order. The ordering of the lines is the order people
were involved in. You are the first as the author and I'm the second so
it should be

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>

rather than the other way around.

>  include/linux/mempolicy.h |    3 ++
>  mm/hugetlb.c              |   14 ++++++----
>  mm/mempolicy.c            |   61 ++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 73 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/mempolicy.c
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/mempolicy.c	2009-08-28 09:21:20.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/mm/mempolicy.c	2009-08-28 09:21:28.000000000 -0400
> @@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm
>  	}
>  	return zl;
>  }
> +
> +/*
> + * huge_mpol_nodes_allowed -- mempolicy extension for huge pages.
> + *
> + * Returns a [pointer to a] nodelist based on the current task's mempolicy
> + * to constraing the allocation and freeing of persistent huge pages
> + * 'Preferred', 'local' and 'interleave' mempolicy will behave more like
> + * 'bind' policy in this context.  An attempt to allocate a persistent huge
> + * page will never "fallback" to another node inside the buddy system
> + * allocator.
> + *
> + * If the task's mempolicy is "default" [NULL], just return NULL for
> + * default behavior.  Otherwise, extract the policy nodemask for 'bind'
> + * or 'interleave' policy or construct a nodemask for 'preferred' or
> + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t.
> + *
> + * N.B., it is the caller's responsibility to free a returned nodemask.
> + */
> +nodemask_t *huge_mpol_nodes_allowed(void)
> +{
> +	nodemask_t *nodes_allowed = NULL;
> +	struct mempolicy *mempolicy;
> +	int nid;
> +
> +	if (!current->mempolicy)
> +		return NULL;
> +
> +	mpol_get(current->mempolicy);
> +	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
> +	if (!nodes_allowed) {
> +		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
> +			"for huge page allocation.  Falling back to default.\n",
> +			current->comm);
> +		goto out;
> +	}
> +	nodes_clear(*nodes_allowed);
> +
> +	mempolicy = current->mempolicy;
> +	switch (mempolicy->mode) {
> +	case MPOL_PREFERRED:
> +		if (mempolicy->flags & MPOL_F_LOCAL)
> +			nid = numa_node_id();
> +		else
> +			nid = mempolicy->v.preferred_node;
> +		node_set(nid, *nodes_allowed);
> +		break;
> +
> +	case MPOL_BIND:
> +		/* Fall through */
> +	case MPOL_INTERLEAVE:
> +		*nodes_allowed =  mempolicy->v.nodes;
> +		break;
> +
> +	default:
> +		BUG();
> +	}
> +
> +out:
> +	mpol_put(current->mempolicy);
> +	return nodes_allowed;
> +}
>  #endif
>  
>  /* Allocate a page in interleaved policy.
> Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/mempolicy.h
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/mempolicy.h	2009-08-28 09:21:20.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/mempolicy.h	2009-08-28 09:21:28.000000000 -0400
> @@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str
>  extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
>  				unsigned long addr, gfp_t gfp_flags,
>  				struct mempolicy **mpol, nodemask_t **nodemask);
> +extern nodemask_t *huge_mpol_nodes_allowed(void);
>  extern unsigned slab_node(struct mempolicy *policy);
>  
>  extern enum zone_type policy_zone;
> @@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone
>  	return node_zonelist(0, gfp_flags);
>  }
>  
> +static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; }
> +
>  static inline int do_migrate_pages(struct mm_struct *mm,
>  			const nodemask_t *from_nodes,
>  			const nodemask_t *to_nodes, int flags)
> Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/hugetlb.c	2009-08-28 09:21:26.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c	2009-08-28 09:21:28.000000000 -0400
> @@ -1248,10 +1248,13 @@ static int adjust_pool_surplus(struct hs
>  static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
>  {
>  	unsigned long min_count, ret;
> +	nodemask_t *nodes_allowed;
>  
>  	if (h->order >= MAX_ORDER)
>  		return h->max_huge_pages;
>  
> +	nodes_allowed = huge_mpol_nodes_allowed();
> +
>  	/*
>  	 * Increase the pool size
>  	 * First take pages out of surplus state.  Then make up the
> @@ -1265,7 +1268,7 @@ static unsigned long set_max_huge_pages(
>  	 */
>  	spin_lock(&hugetlb_lock);
>  	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
> -		if (!adjust_pool_surplus(h, NULL, -1))
> +		if (!adjust_pool_surplus(h, nodes_allowed, -1))
>  			break;
>  	}
>  
> @@ -1276,7 +1279,7 @@ static unsigned long set_max_huge_pages(
>  		 * and reducing the surplus.
>  		 */
>  		spin_unlock(&hugetlb_lock);
> -		ret = alloc_fresh_huge_page(h, NULL);
> +		ret = alloc_fresh_huge_page(h, nodes_allowed);
>  		spin_lock(&hugetlb_lock);
>  		if (!ret)
>  			goto out;
> @@ -1300,18 +1303,19 @@ static unsigned long set_max_huge_pages(
>  	 */
>  	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
>  	min_count = max(count, min_count);
> -	try_to_free_low(h, min_count, NULL);
> +	try_to_free_low(h, min_count, nodes_allowed);
>  	while (min_count < persistent_huge_pages(h)) {
> -		if (!free_pool_huge_page(h, NULL, 0))
> +		if (!free_pool_huge_page(h, nodes_allowed, 0))
>  			break;
>  	}
>  	while (count < persistent_huge_pages(h)) {
> -		if (!adjust_pool_surplus(h, NULL, 1))
> +		if (!adjust_pool_surplus(h, nodes_allowed, 1))
>  			break;
>  	}
>  out:
>  	ret = persistent_huge_pages(h);
>  	spin_unlock(&hugetlb_lock);
> +	kfree(nodes_allowed);
>  	return ret;
>  }
>  
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] hugetlb:  introduce alloc_nodemask_of_node
  2009-08-28 16:03   ` Lee Schermerhorn
@ 2009-09-01 14:49     ` Mel Gorman
  -1 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-01 14:49 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Nishanth Aravamudan, David Rientjes, linux-numa,
	Adam Litke, Andy Whitcroft, eric.whitney

On Fri, Aug 28, 2009 at 12:03:38PM -0400, Lee Schermerhorn wrote:
> [PATCH 4/6] - hugetlb:  introduce alloc_nodemask_of_node()
> 
> Against:  2.6.31-rc7-mmotm-090827-0057
> 
> New in V5 of series
> 
> Introduce nodemask macro to allocate a nodemask and 
> initialize it to contain a single node, using the macro
> init_nodemask_of_node() factored out of the nodemask_of_node()
> macro.
> 
> alloc_nodemask_of_node() coded as a macro to avoid header
> dependency hell.
> 
> This will be used to construct the huge pages "nodes_allowed"
> nodemask for a single node when a persistent huge page
> pool page count is modified via a per node sysfs attribute.
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  include/linux/nodemask.h |   20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/nodemask.h
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/nodemask.h	2009-08-28 09:21:19.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/nodemask.h	2009-08-28 09:21:29.000000000 -0400
> @@ -245,18 +245,34 @@ static inline int __next_node(int n, con
>  	return min_t(int,MAX_NUMNODES,find_next_bit(srcp->bits, MAX_NUMNODES, n+1));
>  }
>  
> +#define init_nodemask_of_nodes(mask, node)				\
> +	nodes_clear(*(mask));						\
> +	node_set((node), *(mask));
> +

Is the done thing to either make this a static inline or else wrap it in
a do { } while(0) ? The reasoning being that if this is used as part of an
another statement (e.g. a for loop) that it'll actually compile instead of
throw up weird error messages.

>  #define nodemask_of_node(node)						\
>  ({									\
>  	typeof(_unused_nodemask_arg_) m;				\
>  	if (sizeof(m) == sizeof(unsigned long)) {			\
>  		m.bits[0] = 1UL<<(node);				\
>  	} else {							\
> -		nodes_clear(m);						\
> -		node_set((node), m);					\
> +		init_nodemask_of_nodes(&m, (node));			\
>  	}								\
>  	m;								\
>  })
>  
> +/*
> + * returns pointer to kmalloc()'d nodemask initialized to contain the
> + * specified node.  Caller must free with kfree().
> + */
> +#define alloc_nodemask_of_node(node)					\
> +({									\
> +	typeof(_unused_nodemask_arg_) *nmp;				\
> +	nmp = kmalloc(sizeof(*nmp), GFP_KERNEL);			\
> +	if (nmp)							\
> +		init_nodemask_of_nodes(nmp, (node));			\
> +	nmp;								\
> +})
> +

Otherwise, it looks ok.

>  #define first_unset_node(mask) __first_unset_node(&(mask))
>  static inline int __first_unset_node(const nodemask_t *maskp)
>  {
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] hugetlb:  introduce alloc_nodemask_of_node
@ 2009-09-01 14:49     ` Mel Gorman
  0 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-01 14:49 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Nishanth Aravamudan, David Rientjes, linux-numa,
	Adam Litke, Andy Whitcroft, eric.whitney

On Fri, Aug 28, 2009 at 12:03:38PM -0400, Lee Schermerhorn wrote:
> [PATCH 4/6] - hugetlb:  introduce alloc_nodemask_of_node()
> 
> Against:  2.6.31-rc7-mmotm-090827-0057
> 
> New in V5 of series
> 
> Introduce nodemask macro to allocate a nodemask and 
> initialize it to contain a single node, using the macro
> init_nodemask_of_node() factored out of the nodemask_of_node()
> macro.
> 
> alloc_nodemask_of_node() coded as a macro to avoid header
> dependency hell.
> 
> This will be used to construct the huge pages "nodes_allowed"
> nodemask for a single node when a persistent huge page
> pool page count is modified via a per node sysfs attribute.
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  include/linux/nodemask.h |   20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/nodemask.h
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/nodemask.h	2009-08-28 09:21:19.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/nodemask.h	2009-08-28 09:21:29.000000000 -0400
> @@ -245,18 +245,34 @@ static inline int __next_node(int n, con
>  	return min_t(int,MAX_NUMNODES,find_next_bit(srcp->bits, MAX_NUMNODES, n+1));
>  }
>  
> +#define init_nodemask_of_nodes(mask, node)				\
> +	nodes_clear(*(mask));						\
> +	node_set((node), *(mask));
> +

Is the done thing to either make this a static inline or else wrap it in
a do { } while(0) ? The reasoning being that if this is used as part of an
another statement (e.g. a for loop) that it'll actually compile instead of
throw up weird error messages.

>  #define nodemask_of_node(node)						\
>  ({									\
>  	typeof(_unused_nodemask_arg_) m;				\
>  	if (sizeof(m) == sizeof(unsigned long)) {			\
>  		m.bits[0] = 1UL<<(node);				\
>  	} else {							\
> -		nodes_clear(m);						\
> -		node_set((node), m);					\
> +		init_nodemask_of_nodes(&m, (node));			\
>  	}								\
>  	m;								\
>  })
>  
> +/*
> + * returns pointer to kmalloc()'d nodemask initialized to contain the
> + * specified node.  Caller must free with kfree().
> + */
> +#define alloc_nodemask_of_node(node)					\
> +({									\
> +	typeof(_unused_nodemask_arg_) *nmp;				\
> +	nmp = kmalloc(sizeof(*nmp), GFP_KERNEL);			\
> +	if (nmp)							\
> +		init_nodemask_of_nodes(nmp, (node));			\
> +	nmp;								\
> +})
> +

Otherwise, it looks ok.

>  #define first_unset_node(mask) __first_unset_node(&(mask))
>  static inline int __first_unset_node(const nodemask_t *maskp)
>  {
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 5/6] hugetlb:  add per node hstate attributes
  2009-08-28 16:03   ` Lee Schermerhorn
@ 2009-09-01 15:20     ` Mel Gorman
  -1 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-01 15:20 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Nishanth Aravamudan, David Rientjes, linux-numa,
	Adam Litke, Andy Whitcroft, eric.whitney

On Fri, Aug 28, 2009 at 12:03:44PM -0400, Lee Schermerhorn wrote:
> [PATCH 5/6] hugetlb:  register per node hugepages attributes
> 
> Against: 2.6.31-rc7-mmotm-090820-1918
> 
> V2:  remove dependency on kobject private bitfield.  Search
>      global hstates then all per node hstates for kobject
>      match in attribute show/store functions.
> 
> V3:  rebase atop the mempolicy-based hugepage alloc/free;
>      use custom "nodes_allowed" to restrict alloc/free to
>      a specific node via per node attributes.  Per node
>      attribute overrides mempolicy.  I.e., mempolicy only
>      applies to global attributes.
> 
> V5:  Fix issues raised by Mel Gorman:
>      + add !NUMA versions of hugetlb_[un]register_node()
>      + rename 'hi' to 'i' in kobj_to_node_hstate()
>      + rename (count, input) to (len, count) in nr_hugepages_store()
>      + moved per node hugepages_kobj and hstate_kobjs[] from the
>        struct node [sysdev] to hugetlb.c private arrays.
>      + changed registration mechanism so that hugetlbfs [a module]
>        register its attributes registration callbacks with the node
>        driver, eliminating the dependency between the node driver
>        and hugetlbfs.  From it's init func, hugetlbfs will register
>        all on-line nodes' hugepage sysfs attributes along with
>        hugetlbfs' attributes register/unregister functions.  The
>        node driver will use these functions to [un]register nodes
>        with hugetlbfs on node hot-plug.
>      + replaced hugetlb.c private "nodes_allowed_from_node()" with
>        [new] generic "alloc_nodemask_of_node()".
> 
> V5a: + fix !NUMA register_hugetlbfs_with_node():  don't use
>        keyword 'do' as parameter name!
> 
> This patch adds the per huge page size control/query attributes
> to the per node sysdevs:
> 
> /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
> 	nr_hugepages       - r/w
> 	free_huge_pages    - r/o
> 	surplus_huge_pages - r/o
> 
> The patch attempts to re-use/share as much of the existing
> global hstate attribute initialization and handling, and the
> "nodes_allowed" constraint processing as possible.
> Calling set_max_huge_pages() with no node indicates a change to
> global hstate parameters.  In this case, any non-default task
> mempolicy will be used to generate the nodes_allowed mask.  A
> valid node id indicates an update to that node's hstate 
> parameters, and the count argument specifies the target count
> for the specified node.  From this info, we compute the target
> global count for the hstate and construct a nodes_allowed node
> mask contain only the specified node.
> 
> Setting the node specific nr_hugepages via the per node attribute
> effectively ignores any task mempolicy or cpuset constraints.
> 
> With this patch:
> 
> (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
> ./  ../  free_hugepages  nr_hugepages  surplus_hugepages
> 
> Starting from:
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:     0
> Node 2 HugePages_Free:      0
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
> vm.nr_hugepages = 0
> 
> Allocate 16 persistent huge pages on node 2:
> (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
> 
> [Note that this is equivalent to:
> 	numactl -m 2 hugeadmin --pool-pages-min 2M:+16
> ]
> 
> Yields:
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:    16
> Node 2 HugePages_Free:     16
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
> vm.nr_hugepages = 16
> 
> Global controls work as expected--reduce pool to 8 persistent huge pages:
> (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> 
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:     8
> Node 2 HugePages_Free:      8
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 

Ok, this appears to be acting as advertised as well. It also builds and
boots on !NUMA which is something it tripped over before. The dependencies
between base NUMA support and hugetlbfs are still looking reasonable to my
eye and you've cleaned up the magic numbers I was whinging about the last time.

Basically I have no problem with it even though I still think that
memory policies were sufficient. I'm going to chicken out and leave it
up to David Rientjes but from me

Acked-by: Mel Gorman <mel@csn.ul.ie>

>  drivers/base/node.c  |   33 ++++++
>  include/linux/node.h |    8 +
>  include/linux/numa.h |    2 
>  mm/hugetlb.c         |  245 ++++++++++++++++++++++++++++++++++++++++++++-------
>  4 files changed, 258 insertions(+), 30 deletions(-)
> 
> Index: linux-2.6.31-rc7-mmotm-090827-0057/drivers/base/node.c
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/drivers/base/node.c	2009-08-28 09:21:17.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/drivers/base/node.c	2009-08-28 09:21:31.000000000 -0400
> @@ -177,6 +177,37 @@ static ssize_t node_read_distance(struct
>  }
>  static SYSDEV_ATTR(distance, S_IRUGO, node_read_distance, NULL);
>  
> +/*
> + * hugetlbfs per node attributes registration interface:
> + * When/if hugetlb[fs] subsystem initializes [sometime after this module],
> + * it will register it's per node attributes for all nodes on-line at that
> + * point.  It will also call register_hugetlbfs_with_node(), below, to
> + * register it's attribute registration functions with this node driver.
> + * Once these hooks have been initialized, the node driver will call into
> + * the hugetlb module to [un]register attributes for hot-plugged nodes.
> + */
> +NODE_REGISTRATION_FUNC __hugetlb_register_node;
> +NODE_REGISTRATION_FUNC __hugetlb_unregister_node;
> +
> +static inline void hugetlb_register_node(struct node *node)
> +{
> +	if (__hugetlb_register_node)
> +		__hugetlb_register_node(node);
> +}
> +
> +static inline void hugetlb_unregister_node(struct node *node)
> +{
> +	if (__hugetlb_unregister_node)
> +		__hugetlb_unregister_node(node);
> +}
> +
> +void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC doregister,
> +                                  NODE_REGISTRATION_FUNC unregister)
> +{
> +	__hugetlb_register_node   = doregister;
> +	__hugetlb_unregister_node = unregister;
> +}
> +
>  
>  /*
>   * register_node - Setup a sysfs device for a node.
> @@ -200,6 +231,7 @@ int register_node(struct node *node, int
>  		sysdev_create_file(&node->sysdev, &attr_distance);
>  
>  		scan_unevictable_register_node(node);
> +		hugetlb_register_node(node);
>  	}
>  	return error;
>  }
> @@ -220,6 +252,7 @@ void unregister_node(struct node *node)
>  	sysdev_remove_file(&node->sysdev, &attr_distance);
>  
>  	scan_unevictable_unregister_node(node);
> +	hugetlb_unregister_node(node);
>  
>  	sysdev_unregister(&node->sysdev);
>  }
> Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/hugetlb.c	2009-08-28 09:21:28.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c	2009-08-28 09:21:31.000000000 -0400
> @@ -24,6 +24,7 @@
>  #include <asm/io.h>
>  
>  #include <linux/hugetlb.h>
> +#include <linux/node.h>
>  #include "internal.h"
>  
>  const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> @@ -1245,7 +1246,8 @@ static int adjust_pool_surplus(struct hs
>  }
>  
>  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> +								int nid)
>  {
>  	unsigned long min_count, ret;
>  	nodemask_t *nodes_allowed;
> @@ -1253,7 +1255,21 @@ static unsigned long set_max_huge_pages(
>  	if (h->order >= MAX_ORDER)
>  		return h->max_huge_pages;
>  
> -	nodes_allowed = huge_mpol_nodes_allowed();
> +	if (nid == NO_NODEID_SPECIFIED)
> +		nodes_allowed = huge_mpol_nodes_allowed();
> +	else {
> +		/*
> +		 * incoming 'count' is for node 'nid' only, so
> +		 * adjust count to global, but restrict alloc/free
> +		 * to the specified node.
> +		 */
> +		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> +		nodes_allowed = alloc_nodemask_of_node(nid);
> +		if (!nodes_allowed)
> +			printk(KERN_WARNING "%s unable to allocate allowed "
> +			       "nodes mask for huge page allocation/free.  "
> +			       "Falling back to default.\n", current->comm);
> +	}
>  
>  	/*
>  	 * Increase the pool size
> @@ -1329,51 +1345,71 @@ out:
>  static struct kobject *hugepages_kobj;
>  static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
>  
> -static struct hstate *kobj_to_hstate(struct kobject *kobj)
> +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp);
> +
> +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
>  {
>  	int i;
> +
>  	for (i = 0; i < HUGE_MAX_HSTATE; i++)
> -		if (hstate_kobjs[i] == kobj)
> +		if (hstate_kobjs[i] == kobj) {
> +			if (nidp)
> +				*nidp = NO_NODEID_SPECIFIED;
>  			return &hstates[i];
> -	BUG();
> -	return NULL;
> +		}
> +
> +	return kobj_to_node_hstate(kobj, nidp);
>  }
>  
>  static ssize_t nr_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> -	return sprintf(buf, "%lu\n", h->nr_huge_pages);
> +	struct hstate *h;
> +	unsigned long nr_huge_pages;
> +	int nid;
> +
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (nid == NO_NODEID_SPECIFIED)
> +		nr_huge_pages = h->nr_huge_pages;
> +	else
> +		nr_huge_pages = h->nr_huge_pages_node[nid];
> +
> +	return sprintf(buf, "%lu\n", nr_huge_pages);
>  }
> +
>  static ssize_t nr_hugepages_store(struct kobject *kobj,
> -		struct kobj_attribute *attr, const char *buf, size_t count)
> +		struct kobj_attribute *attr, const char *buf, size_t len)
>  {
> +	unsigned long count;
> +	struct hstate *h;
> +	int nid;
>  	int err;
> -	unsigned long input;
> -	struct hstate *h = kobj_to_hstate(kobj);
>  
> -	err = strict_strtoul(buf, 10, &input);
> +	err = strict_strtoul(buf, 10, &count);
>  	if (err)
>  		return 0;
>  
> -	h->max_huge_pages = set_max_huge_pages(h, input);
> +	h = kobj_to_hstate(kobj, &nid);
> +	h->max_huge_pages = set_max_huge_pages(h, count, nid);
>  
> -	return count;
> +	return len;
>  }
>  HSTATE_ATTR(nr_hugepages);
>  
>  static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> +
>  	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
>  }
> +
>  static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
>  		struct kobj_attribute *attr, const char *buf, size_t count)
>  {
>  	int err;
>  	unsigned long input;
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h = kobj_to_hstate(kobj, NULL);
>  
>  	err = strict_strtoul(buf, 10, &input);
>  	if (err)
> @@ -1390,15 +1426,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
>  static ssize_t free_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> -	return sprintf(buf, "%lu\n", h->free_huge_pages);
> +	struct hstate *h;
> +	unsigned long free_huge_pages;
> +	int nid;
> +
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (nid == NO_NODEID_SPECIFIED)
> +		free_huge_pages = h->free_huge_pages;
> +	else
> +		free_huge_pages = h->free_huge_pages_node[nid];
> +
> +	return sprintf(buf, "%lu\n", free_huge_pages);
>  }
>  HSTATE_ATTR_RO(free_hugepages);
>  
>  static ssize_t resv_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h = kobj_to_hstate(kobj, NULL);
>  	return sprintf(buf, "%lu\n", h->resv_huge_pages);
>  }
>  HSTATE_ATTR_RO(resv_hugepages);
> @@ -1406,8 +1451,17 @@ HSTATE_ATTR_RO(resv_hugepages);
>  static ssize_t surplus_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> -	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
> +	struct hstate *h;
> +	unsigned long surplus_huge_pages;
> +	int nid;
> +
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (nid == NO_NODEID_SPECIFIED)
> +		surplus_huge_pages = h->surplus_huge_pages;
> +	else
> +		surplus_huge_pages = h->surplus_huge_pages_node[nid];
> +
> +	return sprintf(buf, "%lu\n", surplus_huge_pages);
>  }
>  HSTATE_ATTR_RO(surplus_hugepages);
>  
> @@ -1424,19 +1478,21 @@ static struct attribute_group hstate_att
>  	.attrs = hstate_attrs,
>  };
>  
> -static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
> +static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
> +				struct kobject *parent,
> +				struct kobject **hstate_kobjs,
> +				struct attribute_group *hstate_attr_group)
>  {
>  	int retval;
> +	int hi = h - hstates;
>  
> -	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
> -							hugepages_kobj);
> -	if (!hstate_kobjs[h - hstates])
> +	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
> +	if (!hstate_kobjs[hi])
>  		return -ENOMEM;
>  
> -	retval = sysfs_create_group(hstate_kobjs[h - hstates],
> -							&hstate_attr_group);
> +	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
>  	if (retval)
> -		kobject_put(hstate_kobjs[h - hstates]);
> +		kobject_put(hstate_kobjs[hi]);
>  
>  	return retval;
>  }
> @@ -1451,17 +1507,143 @@ static void __init hugetlb_sysfs_init(vo
>  		return;
>  
>  	for_each_hstate(h) {
> -		err = hugetlb_sysfs_add_hstate(h);
> +		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
> +					 hstate_kobjs, &hstate_attr_group);
>  		if (err)
>  			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
>  								h->name);
>  	}
>  }
>  
> +#ifdef CONFIG_NUMA
> +
> +struct node_hstate {
> +	struct kobject		*hugepages_kobj;
> +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
> +};
> +struct node_hstate node_hstates[MAX_NUMNODES];
> +
> +static struct attribute *per_node_hstate_attrs[] = {
> +	&nr_hugepages_attr.attr,
> +	&free_hugepages_attr.attr,
> +	&surplus_hugepages_attr.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group per_node_hstate_attr_group = {
> +	.attrs = per_node_hstate_attrs,
> +};
> +
> +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> +{
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		struct node_hstate *nhs = &node_hstates[nid];
> +		int i;
> +		for (i = 0; i < HUGE_MAX_HSTATE; i++)
> +			if (nhs->hstate_kobjs[i] == kobj) {
> +				if (nidp)
> +					*nidp = nid;
> +				return &hstates[i];
> +			}
> +	}
> +
> +	BUG();
> +	return NULL;
> +}
> +
> +void hugetlb_unregister_node(struct node *node)
> +{
> +	struct hstate *h;
> +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> +
> +	if (!nhs->hugepages_kobj)
> +		return;
> +
> +	for_each_hstate(h)
> +		if (nhs->hstate_kobjs[h - hstates]) {
> +			kobject_put(nhs->hstate_kobjs[h - hstates]);
> +			nhs->hstate_kobjs[h - hstates] = NULL;
> +		}
> +
> +	kobject_put(nhs->hugepages_kobj);
> +	nhs->hugepages_kobj = NULL;
> +}
> +
> +static void hugetlb_unregister_all_nodes(void)
> +{
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++)
> +		hugetlb_unregister_node(&node_devices[nid]);
> +
> +	register_hugetlbfs_with_node(NULL, NULL);
> +}
> +
> +void hugetlb_register_node(struct node *node)
> +{
> +	struct hstate *h;
> +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> +	int err;
> +
> +	if (nhs->hugepages_kobj)
> +		return;		/* already allocated */
> +
> +	nhs->hugepages_kobj = kobject_create_and_add("hugepages",
> +							&node->sysdev.kobj);
> +	if (!nhs->hugepages_kobj)
> +		return;
> +
> +	for_each_hstate(h) {
> +		err = hugetlb_sysfs_add_hstate(h, nhs->hugepages_kobj,
> +						nhs->hstate_kobjs,
> +						&per_node_hstate_attr_group);
> +		if (err) {
> +			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
> +					" for node %d\n",
> +						h->name, node->sysdev.id);
> +			hugetlb_unregister_node(node);
> +			break;
> +		}
> +	}
> +}
> +
> +static void hugetlb_register_all_nodes(void)
> +{
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		struct node *node = &node_devices[nid];
> +		if (node->sysdev.id == nid)
> +			hugetlb_register_node(node);
> +	}
> +
> +	register_hugetlbfs_with_node(hugetlb_register_node,
> +                                     hugetlb_unregister_node);
> +}
> +#else	/* !CONFIG_NUMA */
> +
> +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> +{
> +	BUG();
> +	if (nidp)
> +		*nidp = -1;
> +	return NULL;
> +}
> +
> +static void hugetlb_unregister_all_nodes(void) { }
> +
> +static void hugetlb_register_all_nodes(void) { }
> +
> +#endif
> +
>  static void __exit hugetlb_exit(void)
>  {
>  	struct hstate *h;
>  
> +	hugetlb_unregister_all_nodes();
> +
>  	for_each_hstate(h) {
>  		kobject_put(hstate_kobjs[h - hstates]);
>  	}
> @@ -1496,6 +1678,8 @@ static int __init hugetlb_init(void)
>  
>  	hugetlb_sysfs_init();
>  
> +	hugetlb_register_all_nodes();
> +
>  	return 0;
>  }
>  module_init(hugetlb_init);
> @@ -1598,7 +1782,8 @@ int hugetlb_sysctl_handler(struct ctl_ta
>  	proc_doulongvec_minmax(table, write, buffer, length, ppos);
>  
>  	if (write)
> -		h->max_huge_pages = set_max_huge_pages(h, tmp);
> +		h->max_huge_pages = set_max_huge_pages(h, tmp,
> +		                                       NO_NODEID_SPECIFIED);
>  
>  	return 0;
>  }
> Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/numa.h	2009-08-28 09:21:17.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h	2009-08-28 09:21:31.000000000 -0400
> @@ -10,4 +10,6 @@
>  
>  #define MAX_NUMNODES    (1 << NODES_SHIFT)
>  
> +#define NO_NODEID_SPECIFIED	(-1)
> +
>  #endif /* _LINUX_NUMA_H */
> Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/node.h
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/node.h	2009-08-28 09:21:17.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/node.h	2009-08-28 09:21:31.000000000 -0400
> @@ -28,6 +28,7 @@ struct node {
>  
>  struct memory_block;
>  extern struct node node_devices[];
> +typedef  void (*NODE_REGISTRATION_FUNC)(struct node *);
>  
>  extern int register_node(struct node *, int, struct node *);
>  extern void unregister_node(struct node *node);
> @@ -39,6 +40,8 @@ extern int unregister_cpu_under_node(uns
>  extern int register_mem_sect_under_node(struct memory_block *mem_blk,
>  						int nid);
>  extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk);
> +extern void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC doregister,
> +                                         NODE_REGISTRATION_FUNC unregister);
>  #else
>  static inline int register_one_node(int nid)
>  {
> @@ -65,6 +68,11 @@ static inline int unregister_mem_sect_un
>  {
>  	return 0;
>  }
> +
> +static inline void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC reg,
> +                                                NODE_REGISTRATION_FUNC unreg)
> +{
> +}
>  #endif
>  
>  #define to_node(sys_device) container_of(sys_device, struct node, sysdev)
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 5/6] hugetlb:  add per node hstate attributes
@ 2009-09-01 15:20     ` Mel Gorman
  0 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-01 15:20 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Nishanth Aravamudan, David Rientjes, linux-numa,
	Adam Litke, Andy Whitcroft, eric.whitney

On Fri, Aug 28, 2009 at 12:03:44PM -0400, Lee Schermerhorn wrote:
> [PATCH 5/6] hugetlb:  register per node hugepages attributes
> 
> Against: 2.6.31-rc7-mmotm-090820-1918
> 
> V2:  remove dependency on kobject private bitfield.  Search
>      global hstates then all per node hstates for kobject
>      match in attribute show/store functions.
> 
> V3:  rebase atop the mempolicy-based hugepage alloc/free;
>      use custom "nodes_allowed" to restrict alloc/free to
>      a specific node via per node attributes.  Per node
>      attribute overrides mempolicy.  I.e., mempolicy only
>      applies to global attributes.
> 
> V5:  Fix issues raised by Mel Gorman:
>      + add !NUMA versions of hugetlb_[un]register_node()
>      + rename 'hi' to 'i' in kobj_to_node_hstate()
>      + rename (count, input) to (len, count) in nr_hugepages_store()
>      + moved per node hugepages_kobj and hstate_kobjs[] from the
>        struct node [sysdev] to hugetlb.c private arrays.
>      + changed registration mechanism so that hugetlbfs [a module]
>        register its attributes registration callbacks with the node
>        driver, eliminating the dependency between the node driver
>        and hugetlbfs.  From it's init func, hugetlbfs will register
>        all on-line nodes' hugepage sysfs attributes along with
>        hugetlbfs' attributes register/unregister functions.  The
>        node driver will use these functions to [un]register nodes
>        with hugetlbfs on node hot-plug.
>      + replaced hugetlb.c private "nodes_allowed_from_node()" with
>        [new] generic "alloc_nodemask_of_node()".
> 
> V5a: + fix !NUMA register_hugetlbfs_with_node():  don't use
>        keyword 'do' as parameter name!
> 
> This patch adds the per huge page size control/query attributes
> to the per node sysdevs:
> 
> /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
> 	nr_hugepages       - r/w
> 	free_huge_pages    - r/o
> 	surplus_huge_pages - r/o
> 
> The patch attempts to re-use/share as much of the existing
> global hstate attribute initialization and handling, and the
> "nodes_allowed" constraint processing as possible.
> Calling set_max_huge_pages() with no node indicates a change to
> global hstate parameters.  In this case, any non-default task
> mempolicy will be used to generate the nodes_allowed mask.  A
> valid node id indicates an update to that node's hstate 
> parameters, and the count argument specifies the target count
> for the specified node.  From this info, we compute the target
> global count for the hstate and construct a nodes_allowed node
> mask contain only the specified node.
> 
> Setting the node specific nr_hugepages via the per node attribute
> effectively ignores any task mempolicy or cpuset constraints.
> 
> With this patch:
> 
> (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
> ./  ../  free_hugepages  nr_hugepages  surplus_hugepages
> 
> Starting from:
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:     0
> Node 2 HugePages_Free:      0
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
> vm.nr_hugepages = 0
> 
> Allocate 16 persistent huge pages on node 2:
> (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
> 
> [Note that this is equivalent to:
> 	numactl -m 2 hugeadmin --pool-pages-min 2M:+16
> ]
> 
> Yields:
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:    16
> Node 2 HugePages_Free:     16
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
> vm.nr_hugepages = 16
> 
> Global controls work as expected--reduce pool to 8 persistent huge pages:
> (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> 
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:     8
> Node 2 HugePages_Free:      8
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 

Ok, this appears to be acting as advertised as well. It also builds and
boots on !NUMA which is something it tripped over before. The dependencies
between base NUMA support and hugetlbfs are still looking reasonable to my
eye and you've cleaned up the magic numbers I was whinging about the last time.

Basically I have no problem with it even though I still think that
memory policies were sufficient. I'm going to chicken out and leave it
up to David Rientjes but from me

Acked-by: Mel Gorman <mel@csn.ul.ie>

>  drivers/base/node.c  |   33 ++++++
>  include/linux/node.h |    8 +
>  include/linux/numa.h |    2 
>  mm/hugetlb.c         |  245 ++++++++++++++++++++++++++++++++++++++++++++-------
>  4 files changed, 258 insertions(+), 30 deletions(-)
> 
> Index: linux-2.6.31-rc7-mmotm-090827-0057/drivers/base/node.c
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/drivers/base/node.c	2009-08-28 09:21:17.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/drivers/base/node.c	2009-08-28 09:21:31.000000000 -0400
> @@ -177,6 +177,37 @@ static ssize_t node_read_distance(struct
>  }
>  static SYSDEV_ATTR(distance, S_IRUGO, node_read_distance, NULL);
>  
> +/*
> + * hugetlbfs per node attributes registration interface:
> + * When/if hugetlb[fs] subsystem initializes [sometime after this module],
> + * it will register it's per node attributes for all nodes on-line at that
> + * point.  It will also call register_hugetlbfs_with_node(), below, to
> + * register it's attribute registration functions with this node driver.
> + * Once these hooks have been initialized, the node driver will call into
> + * the hugetlb module to [un]register attributes for hot-plugged nodes.
> + */
> +NODE_REGISTRATION_FUNC __hugetlb_register_node;
> +NODE_REGISTRATION_FUNC __hugetlb_unregister_node;
> +
> +static inline void hugetlb_register_node(struct node *node)
> +{
> +	if (__hugetlb_register_node)
> +		__hugetlb_register_node(node);
> +}
> +
> +static inline void hugetlb_unregister_node(struct node *node)
> +{
> +	if (__hugetlb_unregister_node)
> +		__hugetlb_unregister_node(node);
> +}
> +
> +void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC doregister,
> +                                  NODE_REGISTRATION_FUNC unregister)
> +{
> +	__hugetlb_register_node   = doregister;
> +	__hugetlb_unregister_node = unregister;
> +}
> +
>  
>  /*
>   * register_node - Setup a sysfs device for a node.
> @@ -200,6 +231,7 @@ int register_node(struct node *node, int
>  		sysdev_create_file(&node->sysdev, &attr_distance);
>  
>  		scan_unevictable_register_node(node);
> +		hugetlb_register_node(node);
>  	}
>  	return error;
>  }
> @@ -220,6 +252,7 @@ void unregister_node(struct node *node)
>  	sysdev_remove_file(&node->sysdev, &attr_distance);
>  
>  	scan_unevictable_unregister_node(node);
> +	hugetlb_unregister_node(node);
>  
>  	sysdev_unregister(&node->sysdev);
>  }
> Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/hugetlb.c	2009-08-28 09:21:28.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c	2009-08-28 09:21:31.000000000 -0400
> @@ -24,6 +24,7 @@
>  #include <asm/io.h>
>  
>  #include <linux/hugetlb.h>
> +#include <linux/node.h>
>  #include "internal.h"
>  
>  const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> @@ -1245,7 +1246,8 @@ static int adjust_pool_surplus(struct hs
>  }
>  
>  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> +								int nid)
>  {
>  	unsigned long min_count, ret;
>  	nodemask_t *nodes_allowed;
> @@ -1253,7 +1255,21 @@ static unsigned long set_max_huge_pages(
>  	if (h->order >= MAX_ORDER)
>  		return h->max_huge_pages;
>  
> -	nodes_allowed = huge_mpol_nodes_allowed();
> +	if (nid == NO_NODEID_SPECIFIED)
> +		nodes_allowed = huge_mpol_nodes_allowed();
> +	else {
> +		/*
> +		 * incoming 'count' is for node 'nid' only, so
> +		 * adjust count to global, but restrict alloc/free
> +		 * to the specified node.
> +		 */
> +		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> +		nodes_allowed = alloc_nodemask_of_node(nid);
> +		if (!nodes_allowed)
> +			printk(KERN_WARNING "%s unable to allocate allowed "
> +			       "nodes mask for huge page allocation/free.  "
> +			       "Falling back to default.\n", current->comm);
> +	}
>  
>  	/*
>  	 * Increase the pool size
> @@ -1329,51 +1345,71 @@ out:
>  static struct kobject *hugepages_kobj;
>  static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
>  
> -static struct hstate *kobj_to_hstate(struct kobject *kobj)
> +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp);
> +
> +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
>  {
>  	int i;
> +
>  	for (i = 0; i < HUGE_MAX_HSTATE; i++)
> -		if (hstate_kobjs[i] == kobj)
> +		if (hstate_kobjs[i] == kobj) {
> +			if (nidp)
> +				*nidp = NO_NODEID_SPECIFIED;
>  			return &hstates[i];
> -	BUG();
> -	return NULL;
> +		}
> +
> +	return kobj_to_node_hstate(kobj, nidp);
>  }
>  
>  static ssize_t nr_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> -	return sprintf(buf, "%lu\n", h->nr_huge_pages);
> +	struct hstate *h;
> +	unsigned long nr_huge_pages;
> +	int nid;
> +
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (nid == NO_NODEID_SPECIFIED)
> +		nr_huge_pages = h->nr_huge_pages;
> +	else
> +		nr_huge_pages = h->nr_huge_pages_node[nid];
> +
> +	return sprintf(buf, "%lu\n", nr_huge_pages);
>  }
> +
>  static ssize_t nr_hugepages_store(struct kobject *kobj,
> -		struct kobj_attribute *attr, const char *buf, size_t count)
> +		struct kobj_attribute *attr, const char *buf, size_t len)
>  {
> +	unsigned long count;
> +	struct hstate *h;
> +	int nid;
>  	int err;
> -	unsigned long input;
> -	struct hstate *h = kobj_to_hstate(kobj);
>  
> -	err = strict_strtoul(buf, 10, &input);
> +	err = strict_strtoul(buf, 10, &count);
>  	if (err)
>  		return 0;
>  
> -	h->max_huge_pages = set_max_huge_pages(h, input);
> +	h = kobj_to_hstate(kobj, &nid);
> +	h->max_huge_pages = set_max_huge_pages(h, count, nid);
>  
> -	return count;
> +	return len;
>  }
>  HSTATE_ATTR(nr_hugepages);
>  
>  static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> +
>  	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
>  }
> +
>  static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
>  		struct kobj_attribute *attr, const char *buf, size_t count)
>  {
>  	int err;
>  	unsigned long input;
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h = kobj_to_hstate(kobj, NULL);
>  
>  	err = strict_strtoul(buf, 10, &input);
>  	if (err)
> @@ -1390,15 +1426,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
>  static ssize_t free_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> -	return sprintf(buf, "%lu\n", h->free_huge_pages);
> +	struct hstate *h;
> +	unsigned long free_huge_pages;
> +	int nid;
> +
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (nid == NO_NODEID_SPECIFIED)
> +		free_huge_pages = h->free_huge_pages;
> +	else
> +		free_huge_pages = h->free_huge_pages_node[nid];
> +
> +	return sprintf(buf, "%lu\n", free_huge_pages);
>  }
>  HSTATE_ATTR_RO(free_hugepages);
>  
>  static ssize_t resv_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h = kobj_to_hstate(kobj, NULL);
>  	return sprintf(buf, "%lu\n", h->resv_huge_pages);
>  }
>  HSTATE_ATTR_RO(resv_hugepages);
> @@ -1406,8 +1451,17 @@ HSTATE_ATTR_RO(resv_hugepages);
>  static ssize_t surplus_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> -	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
> +	struct hstate *h;
> +	unsigned long surplus_huge_pages;
> +	int nid;
> +
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (nid == NO_NODEID_SPECIFIED)
> +		surplus_huge_pages = h->surplus_huge_pages;
> +	else
> +		surplus_huge_pages = h->surplus_huge_pages_node[nid];
> +
> +	return sprintf(buf, "%lu\n", surplus_huge_pages);
>  }
>  HSTATE_ATTR_RO(surplus_hugepages);
>  
> @@ -1424,19 +1478,21 @@ static struct attribute_group hstate_att
>  	.attrs = hstate_attrs,
>  };
>  
> -static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
> +static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
> +				struct kobject *parent,
> +				struct kobject **hstate_kobjs,
> +				struct attribute_group *hstate_attr_group)
>  {
>  	int retval;
> +	int hi = h - hstates;
>  
> -	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
> -							hugepages_kobj);
> -	if (!hstate_kobjs[h - hstates])
> +	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
> +	if (!hstate_kobjs[hi])
>  		return -ENOMEM;
>  
> -	retval = sysfs_create_group(hstate_kobjs[h - hstates],
> -							&hstate_attr_group);
> +	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
>  	if (retval)
> -		kobject_put(hstate_kobjs[h - hstates]);
> +		kobject_put(hstate_kobjs[hi]);
>  
>  	return retval;
>  }
> @@ -1451,17 +1507,143 @@ static void __init hugetlb_sysfs_init(vo
>  		return;
>  
>  	for_each_hstate(h) {
> -		err = hugetlb_sysfs_add_hstate(h);
> +		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
> +					 hstate_kobjs, &hstate_attr_group);
>  		if (err)
>  			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
>  								h->name);
>  	}
>  }
>  
> +#ifdef CONFIG_NUMA
> +
> +struct node_hstate {
> +	struct kobject		*hugepages_kobj;
> +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
> +};
> +struct node_hstate node_hstates[MAX_NUMNODES];
> +
> +static struct attribute *per_node_hstate_attrs[] = {
> +	&nr_hugepages_attr.attr,
> +	&free_hugepages_attr.attr,
> +	&surplus_hugepages_attr.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group per_node_hstate_attr_group = {
> +	.attrs = per_node_hstate_attrs,
> +};
> +
> +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> +{
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		struct node_hstate *nhs = &node_hstates[nid];
> +		int i;
> +		for (i = 0; i < HUGE_MAX_HSTATE; i++)
> +			if (nhs->hstate_kobjs[i] == kobj) {
> +				if (nidp)
> +					*nidp = nid;
> +				return &hstates[i];
> +			}
> +	}
> +
> +	BUG();
> +	return NULL;
> +}
> +
> +void hugetlb_unregister_node(struct node *node)
> +{
> +	struct hstate *h;
> +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> +
> +	if (!nhs->hugepages_kobj)
> +		return;
> +
> +	for_each_hstate(h)
> +		if (nhs->hstate_kobjs[h - hstates]) {
> +			kobject_put(nhs->hstate_kobjs[h - hstates]);
> +			nhs->hstate_kobjs[h - hstates] = NULL;
> +		}
> +
> +	kobject_put(nhs->hugepages_kobj);
> +	nhs->hugepages_kobj = NULL;
> +}
> +
> +static void hugetlb_unregister_all_nodes(void)
> +{
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++)
> +		hugetlb_unregister_node(&node_devices[nid]);
> +
> +	register_hugetlbfs_with_node(NULL, NULL);
> +}
> +
> +void hugetlb_register_node(struct node *node)
> +{
> +	struct hstate *h;
> +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> +	int err;
> +
> +	if (nhs->hugepages_kobj)
> +		return;		/* already allocated */
> +
> +	nhs->hugepages_kobj = kobject_create_and_add("hugepages",
> +							&node->sysdev.kobj);
> +	if (!nhs->hugepages_kobj)
> +		return;
> +
> +	for_each_hstate(h) {
> +		err = hugetlb_sysfs_add_hstate(h, nhs->hugepages_kobj,
> +						nhs->hstate_kobjs,
> +						&per_node_hstate_attr_group);
> +		if (err) {
> +			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
> +					" for node %d\n",
> +						h->name, node->sysdev.id);
> +			hugetlb_unregister_node(node);
> +			break;
> +		}
> +	}
> +}
> +
> +static void hugetlb_register_all_nodes(void)
> +{
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		struct node *node = &node_devices[nid];
> +		if (node->sysdev.id == nid)
> +			hugetlb_register_node(node);
> +	}
> +
> +	register_hugetlbfs_with_node(hugetlb_register_node,
> +                                     hugetlb_unregister_node);
> +}
> +#else	/* !CONFIG_NUMA */
> +
> +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> +{
> +	BUG();
> +	if (nidp)
> +		*nidp = -1;
> +	return NULL;
> +}
> +
> +static void hugetlb_unregister_all_nodes(void) { }
> +
> +static void hugetlb_register_all_nodes(void) { }
> +
> +#endif
> +
>  static void __exit hugetlb_exit(void)
>  {
>  	struct hstate *h;
>  
> +	hugetlb_unregister_all_nodes();
> +
>  	for_each_hstate(h) {
>  		kobject_put(hstate_kobjs[h - hstates]);
>  	}
> @@ -1496,6 +1678,8 @@ static int __init hugetlb_init(void)
>  
>  	hugetlb_sysfs_init();
>  
> +	hugetlb_register_all_nodes();
> +
>  	return 0;
>  }
>  module_init(hugetlb_init);
> @@ -1598,7 +1782,8 @@ int hugetlb_sysctl_handler(struct ctl_ta
>  	proc_doulongvec_minmax(table, write, buffer, length, ppos);
>  
>  	if (write)
> -		h->max_huge_pages = set_max_huge_pages(h, tmp);
> +		h->max_huge_pages = set_max_huge_pages(h, tmp,
> +		                                       NO_NODEID_SPECIFIED);
>  
>  	return 0;
>  }
> Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/numa.h	2009-08-28 09:21:17.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h	2009-08-28 09:21:31.000000000 -0400
> @@ -10,4 +10,6 @@
>  
>  #define MAX_NUMNODES    (1 << NODES_SHIFT)
>  
> +#define NO_NODEID_SPECIFIED	(-1)
> +
>  #endif /* _LINUX_NUMA_H */
> Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/node.h
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/node.h	2009-08-28 09:21:17.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/node.h	2009-08-28 09:21:31.000000000 -0400
> @@ -28,6 +28,7 @@ struct node {
>  
>  struct memory_block;
>  extern struct node node_devices[];
> +typedef  void (*NODE_REGISTRATION_FUNC)(struct node *);
>  
>  extern int register_node(struct node *, int, struct node *);
>  extern void unregister_node(struct node *node);
> @@ -39,6 +40,8 @@ extern int unregister_cpu_under_node(uns
>  extern int register_mem_sect_under_node(struct memory_block *mem_blk,
>  						int nid);
>  extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk);
> +extern void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC doregister,
> +                                         NODE_REGISTRATION_FUNC unregister);
>  #else
>  static inline int register_one_node(int nid)
>  {
> @@ -65,6 +68,11 @@ static inline int unregister_mem_sect_un
>  {
>  	return 0;
>  }
> +
> +static inline void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC reg,
> +                                                NODE_REGISTRATION_FUNC unreg)
> +{
> +}
>  #endif
>  
>  #define to_node(sys_device) container_of(sys_device, struct node, sysdev)
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] hugetlb:  introduce alloc_nodemask_of_node
  2009-09-01 14:49     ` Mel Gorman
@ 2009-09-01 16:42       ` Lee Schermerhorn
  -1 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-09-01 16:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, akpm, Nishanth Aravamudan, David Rientjes, linux-numa,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 2009-09-01 at 15:49 +0100, Mel Gorman wrote:
> On Fri, Aug 28, 2009 at 12:03:38PM -0400, Lee Schermerhorn wrote:
> > [PATCH 4/6] - hugetlb:  introduce alloc_nodemask_of_node()
> > 
> > Against:  2.6.31-rc7-mmotm-090827-0057
> > 
> > New in V5 of series
> > 
> > Introduce nodemask macro to allocate a nodemask and 
> > initialize it to contain a single node, using the macro
> > init_nodemask_of_node() factored out of the nodemask_of_node()
> > macro.
> > 
> > alloc_nodemask_of_node() coded as a macro to avoid header
> > dependency hell.
> > 
> > This will be used to construct the huge pages "nodes_allowed"
> > nodemask for a single node when a persistent huge page
> > pool page count is modified via a per node sysfs attribute.
> > 
> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> >  include/linux/nodemask.h |   20 ++++++++++++++++++--
> >  1 file changed, 18 insertions(+), 2 deletions(-)
> > 
> > Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/nodemask.h
> > ===================================================================
> > --- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/nodemask.h	2009-08-28 09:21:19.000000000 -0400
> > +++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/nodemask.h	2009-08-28 09:21:29.000000000 -0400
> > @@ -245,18 +245,34 @@ static inline int __next_node(int n, con
> >  	return min_t(int,MAX_NUMNODES,find_next_bit(srcp->bits, MAX_NUMNODES, n+1));
> >  }
> >  
> > +#define init_nodemask_of_nodes(mask, node)				\
> > +	nodes_clear(*(mask));						\
> > +	node_set((node), *(mask));
> > +
> 
> Is the done thing to either make this a static inline or else wrap it in
> a do { } while(0) ? The reasoning being that if this is used as part of an
> another statement (e.g. a for loop) that it'll actually compile instead of
> throw up weird error messages.

Right.  I'll fix this [and signoff/review orders] next time [maybe last
time?].  It occurs to me that I can also use this for
huge_mpol_nodes_allowed(), so I'll move it up in the series and fix that
[which you've already ack'd].  I'll wait a bit to hear from David before
I respin.

Thanks,
Lee
> 
> >  #define nodemask_of_node(node)						\
> >  ({									\
> >  	typeof(_unused_nodemask_arg_) m;				\
> >  	if (sizeof(m) == sizeof(unsigned long)) {			\
> >  		m.bits[0] = 1UL<<(node);				\
> >  	} else {							\
> > -		nodes_clear(m);						\
> > -		node_set((node), m);					\
> > +		init_nodemask_of_nodes(&m, (node));			\
> >  	}								\
> >  	m;								\
> >  })
> >  
> > +/*
> > + * returns pointer to kmalloc()'d nodemask initialized to contain the
> > + * specified node.  Caller must free with kfree().
> > + */
> > +#define alloc_nodemask_of_node(node)					\
> > +({									\
> > +	typeof(_unused_nodemask_arg_) *nmp;				\
> > +	nmp = kmalloc(sizeof(*nmp), GFP_KERNEL);			\
> > +	if (nmp)							\
> > +		init_nodemask_of_nodes(nmp, (node));			\
> > +	nmp;								\
> > +})
> > +
> 
> Otherwise, it looks ok.
> 
> >  #define first_unset_node(mask) __first_unset_node(&(mask))
> >  static inline int __first_unset_node(const nodemask_t *maskp)
> >  {
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] hugetlb:  introduce alloc_nodemask_of_node
@ 2009-09-01 16:42       ` Lee Schermerhorn
  0 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-09-01 16:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, akpm, Nishanth Aravamudan, David Rientjes, linux-numa,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 2009-09-01 at 15:49 +0100, Mel Gorman wrote:
> On Fri, Aug 28, 2009 at 12:03:38PM -0400, Lee Schermerhorn wrote:
> > [PATCH 4/6] - hugetlb:  introduce alloc_nodemask_of_node()
> > 
> > Against:  2.6.31-rc7-mmotm-090827-0057
> > 
> > New in V5 of series
> > 
> > Introduce nodemask macro to allocate a nodemask and 
> > initialize it to contain a single node, using the macro
> > init_nodemask_of_node() factored out of the nodemask_of_node()
> > macro.
> > 
> > alloc_nodemask_of_node() coded as a macro to avoid header
> > dependency hell.
> > 
> > This will be used to construct the huge pages "nodes_allowed"
> > nodemask for a single node when a persistent huge page
> > pool page count is modified via a per node sysfs attribute.
> > 
> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> >  include/linux/nodemask.h |   20 ++++++++++++++++++--
> >  1 file changed, 18 insertions(+), 2 deletions(-)
> > 
> > Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/nodemask.h
> > ===================================================================
> > --- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/nodemask.h	2009-08-28 09:21:19.000000000 -0400
> > +++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/nodemask.h	2009-08-28 09:21:29.000000000 -0400
> > @@ -245,18 +245,34 @@ static inline int __next_node(int n, con
> >  	return min_t(int,MAX_NUMNODES,find_next_bit(srcp->bits, MAX_NUMNODES, n+1));
> >  }
> >  
> > +#define init_nodemask_of_nodes(mask, node)				\
> > +	nodes_clear(*(mask));						\
> > +	node_set((node), *(mask));
> > +
> 
> Is the done thing to either make this a static inline or else wrap it in
> a do { } while(0) ? The reasoning being that if this is used as part of an
> another statement (e.g. a for loop) that it'll actually compile instead of
> throw up weird error messages.

Right.  I'll fix this [and signoff/review orders] next time [maybe last
time?].  It occurs to me that I can also use this for
huge_mpol_nodes_allowed(), so I'll move it up in the series and fix that
[which you've already ack'd].  I'll wait a bit to hear from David before
I respin.

Thanks,
Lee
> 
> >  #define nodemask_of_node(node)						\
> >  ({									\
> >  	typeof(_unused_nodemask_arg_) m;				\
> >  	if (sizeof(m) == sizeof(unsigned long)) {			\
> >  		m.bits[0] = 1UL<<(node);				\
> >  	} else {							\
> > -		nodes_clear(m);						\
> > -		node_set((node), m);					\
> > +		init_nodemask_of_nodes(&m, (node));			\
> >  	}								\
> >  	m;								\
> >  })
> >  
> > +/*
> > + * returns pointer to kmalloc()'d nodemask initialized to contain the
> > + * specified node.  Caller must free with kfree().
> > + */
> > +#define alloc_nodemask_of_node(node)					\
> > +({									\
> > +	typeof(_unused_nodemask_arg_) *nmp;				\
> > +	nmp = kmalloc(sizeof(*nmp), GFP_KERNEL);			\
> > +	if (nmp)							\
> > +		init_nodemask_of_nodes(nmp, (node));			\
> > +	nmp;								\
> > +})
> > +
> 
> Otherwise, it looks ok.
> 
> >  #define first_unset_node(mask) __first_unset_node(&(mask))
> >  static inline int __first_unset_node(const nodemask_t *maskp)
> >  {
> > 
> 


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] hugetlb:  introduce alloc_nodemask_of_node
  2009-09-01 16:42       ` Lee Schermerhorn
@ 2009-09-03 18:34         ` David Rientjes
  -1 siblings, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-03 18:34 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, linux-mm, akpm, Nishanth Aravamudan, linux-numa,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 1 Sep 2009, Lee Schermerhorn wrote:

> > > Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/nodemask.h
> > > ===================================================================
> > > --- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/nodemask.h	2009-08-28 09:21:19.000000000 -0400
> > > +++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/nodemask.h	2009-08-28 09:21:29.000000000 -0400
> > > @@ -245,18 +245,34 @@ static inline int __next_node(int n, con
> > >  	return min_t(int,MAX_NUMNODES,find_next_bit(srcp->bits, MAX_NUMNODES, n+1));
> > >  }
> > >  
> > > +#define init_nodemask_of_nodes(mask, node)				\
> > > +	nodes_clear(*(mask));						\
> > > +	node_set((node), *(mask));
> > > +
> > 
> > Is the done thing to either make this a static inline or else wrap it in
> > a do { } while(0) ? The reasoning being that if this is used as part of an
> > another statement (e.g. a for loop) that it'll actually compile instead of
> > throw up weird error messages.
> 
> Right.  I'll fix this [and signoff/review orders] next time [maybe last
> time?].  It occurs to me that I can also use this for
> huge_mpol_nodes_allowed(), so I'll move it up in the series and fix that
> [which you've already ack'd].  I'll wait a bit to hear from David before
> I respin.
> 

I think it should be an inline function just so there's typechecking on 
the first argument passed in (and so alloc_nodemask_of_node() below 
doesn't get a NULL pointer dereference on node_set() if nmp can't be 
allocated).

I've seen the issue about the signed-off-by/reviewed-by/acked-by order 
come up before.  I've always put my signed-off-by line last whenever 
proposing patches because it shows a clear order in who gathered those 
lines when submitting to -mm, for example.  If I write

	Cc: Mel Gorman <mel@csn.ul.ie>
	Signed-off-by: David Rientjes <rientjes@google.com>

it is clear that I cc'd Mel on the initial proposal.  If it is the other 
way around, for example,

	Signed-off-by: David Rientjes <rientjes@google.com>
	Cc: Mel Gorman <mel@csn.ul.ie>
	Signed-off-by: Andrew Morton...

then it indicates Andrew added the cc when merging into -mm.  That's more 
relevant when such a line is acked-by or reviewed-by since it is now 
possible to determine who received such acknowledgement from the 
individual and is responsible for correctly relaying it in the patch 
submission.

If it's done this way, it indicates that whoever is signing off the patch 
is responsible for everything above it.  The type of line (signed-off-by, 
reviewed-by, acked-by) is enough of an indication about the development 
history of the patch, I believe, and it doesn't require specific ordering 
to communicate (and the first line having to be a signed-off-by line isn't 
really important, it doesn't replace the From: line).

It also appears to be how both Linus merges his own patches with Cc's.

> > > +/*
> > > + * returns pointer to kmalloc()'d nodemask initialized to contain the
> > > + * specified node.  Caller must free with kfree().
> > > + */
> > > +#define alloc_nodemask_of_node(node)					\
> > > +({									\
> > > +	typeof(_unused_nodemask_arg_) *nmp;				\
> > > +	nmp = kmalloc(sizeof(*nmp), GFP_KERNEL);			\
> > > +	if (nmp)							\
> > > +		init_nodemask_of_nodes(nmp, (node));			\
> > > +	nmp;								\
> > > +})
> > > +
> > 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] hugetlb:  introduce alloc_nodemask_of_node
@ 2009-09-03 18:34         ` David Rientjes
  0 siblings, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-03 18:34 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, linux-mm, akpm, Nishanth Aravamudan, linux-numa,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 1 Sep 2009, Lee Schermerhorn wrote:

> > > Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/nodemask.h
> > > ===================================================================
> > > --- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/nodemask.h	2009-08-28 09:21:19.000000000 -0400
> > > +++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/nodemask.h	2009-08-28 09:21:29.000000000 -0400
> > > @@ -245,18 +245,34 @@ static inline int __next_node(int n, con
> > >  	return min_t(int,MAX_NUMNODES,find_next_bit(srcp->bits, MAX_NUMNODES, n+1));
> > >  }
> > >  
> > > +#define init_nodemask_of_nodes(mask, node)				\
> > > +	nodes_clear(*(mask));						\
> > > +	node_set((node), *(mask));
> > > +
> > 
> > Is the done thing to either make this a static inline or else wrap it in
> > a do { } while(0) ? The reasoning being that if this is used as part of an
> > another statement (e.g. a for loop) that it'll actually compile instead of
> > throw up weird error messages.
> 
> Right.  I'll fix this [and signoff/review orders] next time [maybe last
> time?].  It occurs to me that I can also use this for
> huge_mpol_nodes_allowed(), so I'll move it up in the series and fix that
> [which you've already ack'd].  I'll wait a bit to hear from David before
> I respin.
> 

I think it should be an inline function just so there's typechecking on 
the first argument passed in (and so alloc_nodemask_of_node() below 
doesn't get a NULL pointer dereference on node_set() if nmp can't be 
allocated).

I've seen the issue about the signed-off-by/reviewed-by/acked-by order 
come up before.  I've always put my signed-off-by line last whenever 
proposing patches because it shows a clear order in who gathered those 
lines when submitting to -mm, for example.  If I write

	Cc: Mel Gorman <mel@csn.ul.ie>
	Signed-off-by: David Rientjes <rientjes@google.com>

it is clear that I cc'd Mel on the initial proposal.  If it is the other 
way around, for example,

	Signed-off-by: David Rientjes <rientjes@google.com>
	Cc: Mel Gorman <mel@csn.ul.ie>
	Signed-off-by: Andrew Morton...

then it indicates Andrew added the cc when merging into -mm.  That's more 
relevant when such a line is acked-by or reviewed-by since it is now 
possible to determine who received such acknowledgement from the 
individual and is responsible for correctly relaying it in the patch 
submission.

If it's done this way, it indicates that whoever is signing off the patch 
is responsible for everything above it.  The type of line (signed-off-by, 
reviewed-by, acked-by) is enough of an indication about the development 
history of the patch, I believe, and it doesn't require specific ordering 
to communicate (and the first line having to be a signed-off-by line isn't 
really important, it doesn't replace the From: line).

It also appears to be how both Linus merges his own patches with Cc's.

> > > +/*
> > > + * returns pointer to kmalloc()'d nodemask initialized to contain the
> > > + * specified node.  Caller must free with kfree().
> > > + */
> > > +#define alloc_nodemask_of_node(node)					\
> > > +({									\
> > > +	typeof(_unused_nodemask_arg_) *nmp;				\
> > > +	nmp = kmalloc(sizeof(*nmp), GFP_KERNEL);			\
> > > +	if (nmp)							\
> > > +		init_nodemask_of_nodes(nmp, (node));			\
> > > +	nmp;								\
> > > +})
> > > +
> > 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/6] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns
  2009-08-28 16:03   ` Lee Schermerhorn
@ 2009-09-03 18:39     ` David Rientjes
  -1 siblings, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-03 18:39 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Mel Gorman, Nishanth Aravamudan, linux-numa,
	Adam Litke, Andy Whitcroft, eric.whitney

On Fri, 28 Aug 2009, Lee Schermerhorn wrote:

> [PATCH 2/6] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns
> 
> Against:  2.6.31-rc7-mmotm-090827-0057
> 
> V3:
> + moved this patch to after the "rework" of hstate_next_node_to_...
>   functions as this patch is more specific to using task mempolicy
>   to control huge page allocation and freeing.
> 
> V5:
> + removed now unneeded 'nextnid' from hstate_next_node_to_{alloc|free}
>   and updated the stale comments.
> 
> In preparation for constraining huge page allocation and freeing by the
> controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer
> to the allocate, free and surplus adjustment functions.  For now, pass
> NULL to indicate default behavior--i.e., use node_online_map.  A
> subsqeuent patch will derive a non-default mask from the controlling 
> task's numa mempolicy.
> 
> Note that this method of updating the global hstate nr_hugepages under
> the constraint of a nodemask simplifies keeping the global state 
> consistent--especially the number of persistent and surplus pages
> relative to reservations and overcommit limits.  There are undoubtedly
> other ways to do this, but this works for both interfaces:  mempolicy
> and per node attributes.
> 
> Reviewed-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

Still think the name `this_node_allowed()' is awkward, but I'm glad to see 
hstate_next_node_to_{alloc,free} is clean.

Acked-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/6] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns
@ 2009-09-03 18:39     ` David Rientjes
  0 siblings, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-03 18:39 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Mel Gorman, Nishanth Aravamudan, linux-numa,
	Adam Litke, Andy Whitcroft, eric.whitney

On Fri, 28 Aug 2009, Lee Schermerhorn wrote:

> [PATCH 2/6] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns
> 
> Against:  2.6.31-rc7-mmotm-090827-0057
> 
> V3:
> + moved this patch to after the "rework" of hstate_next_node_to_...
>   functions as this patch is more specific to using task mempolicy
>   to control huge page allocation and freeing.
> 
> V5:
> + removed now unneeded 'nextnid' from hstate_next_node_to_{alloc|free}
>   and updated the stale comments.
> 
> In preparation for constraining huge page allocation and freeing by the
> controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer
> to the allocate, free and surplus adjustment functions.  For now, pass
> NULL to indicate default behavior--i.e., use node_online_map.  A
> subsqeuent patch will derive a non-default mask from the controlling 
> task's numa mempolicy.
> 
> Note that this method of updating the global hstate nr_hugepages under
> the constraint of a nodemask simplifies keeping the global state 
> consistent--especially the number of persistent and surplus pages
> relative to reservations and overcommit limits.  There are undoubtedly
> other ways to do this, but this works for both interfaces:  mempolicy
> and per node attributes.
> 
> Reviewed-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

Still think the name `this_node_allowed()' is awkward, but I'm glad to see 
hstate_next_node_to_{alloc,free} is clean.

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] hugetlb:  derive huge pages nodes allowed from task mempolicy
  2009-08-28 16:03   ` Lee Schermerhorn
@ 2009-09-03 19:22     ` David Rientjes
  -1 siblings, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-03 19:22 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Mel Gorman, Nishanth Aravamudan, linux-numa,
	Adam Litke, Andy Whitcroft, eric.whitney

On Fri, 28 Aug 2009, Lee Schermerhorn wrote:

> Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/mempolicy.c
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/mempolicy.c	2009-08-28 09:21:20.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/mm/mempolicy.c	2009-08-28 09:21:28.000000000 -0400
> @@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm
>  	}
>  	return zl;
>  }
> +
> +/*
> + * huge_mpol_nodes_allowed -- mempolicy extension for huge pages.
> + *
> + * Returns a [pointer to a] nodelist based on the current task's mempolicy
> + * to constraing the allocation and freeing of persistent huge pages
> + * 'Preferred', 'local' and 'interleave' mempolicy will behave more like
> + * 'bind' policy in this context.  An attempt to allocate a persistent huge
> + * page will never "fallback" to another node inside the buddy system
> + * allocator.
> + *
> + * If the task's mempolicy is "default" [NULL], just return NULL for
> + * default behavior.  Otherwise, extract the policy nodemask for 'bind'
> + * or 'interleave' policy or construct a nodemask for 'preferred' or
> + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t.
> + *
> + * N.B., it is the caller's responsibility to free a returned nodemask.
> + */

This isn't limited to only hugepage code, so a more appropriate name would 
probably be better.

It'd probably be better to check for a NULL nodes_allowed either in 
set_max_huge_pages() than in hstate_next_node_to_{alloc,free} just for the 
cleanliness of the code OR simply return node_online_map from this 
function for default policies.

Otherwise

Acked-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] hugetlb:  derive huge pages nodes allowed from task mempolicy
@ 2009-09-03 19:22     ` David Rientjes
  0 siblings, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-03 19:22 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Mel Gorman, Nishanth Aravamudan, linux-numa,
	Adam Litke, Andy Whitcroft, eric.whitney

On Fri, 28 Aug 2009, Lee Schermerhorn wrote:

> Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/mempolicy.c
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/mempolicy.c	2009-08-28 09:21:20.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/mm/mempolicy.c	2009-08-28 09:21:28.000000000 -0400
> @@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm
>  	}
>  	return zl;
>  }
> +
> +/*
> + * huge_mpol_nodes_allowed -- mempolicy extension for huge pages.
> + *
> + * Returns a [pointer to a] nodelist based on the current task's mempolicy
> + * to constraing the allocation and freeing of persistent huge pages
> + * 'Preferred', 'local' and 'interleave' mempolicy will behave more like
> + * 'bind' policy in this context.  An attempt to allocate a persistent huge
> + * page will never "fallback" to another node inside the buddy system
> + * allocator.
> + *
> + * If the task's mempolicy is "default" [NULL], just return NULL for
> + * default behavior.  Otherwise, extract the policy nodemask for 'bind'
> + * or 'interleave' policy or construct a nodemask for 'preferred' or
> + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t.
> + *
> + * N.B., it is the caller's responsibility to free a returned nodemask.
> + */

This isn't limited to only hugepage code, so a more appropriate name would 
probably be better.

It'd probably be better to check for a NULL nodes_allowed either in 
set_max_huge_pages() than in hstate_next_node_to_{alloc,free} just for the 
cleanliness of the code OR simply return node_online_map from this 
function for default policies.

Otherwise

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 5/6] hugetlb:  add per node hstate attributes
  2009-08-28 16:03   ` Lee Schermerhorn
@ 2009-09-03 19:52     ` David Rientjes
  -1 siblings, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-03 19:52 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, Andrew Morton, Mel Gorman, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

On Fri, 28 Aug 2009, Lee Schermerhorn wrote:

> Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/hugetlb.c	2009-08-28 09:21:28.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c	2009-08-28 09:21:31.000000000 -0400
> @@ -24,6 +24,7 @@
>  #include <asm/io.h>
>  
>  #include <linux/hugetlb.h>
> +#include <linux/node.h>
>  #include "internal.h"
>  
>  const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> @@ -1245,7 +1246,8 @@ static int adjust_pool_surplus(struct hs
>  }
>  
>  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> +								int nid)
>  {
>  	unsigned long min_count, ret;
>  	nodemask_t *nodes_allowed;
> @@ -1253,7 +1255,21 @@ static unsigned long set_max_huge_pages(
>  	if (h->order >= MAX_ORDER)
>  		return h->max_huge_pages;
>  
> -	nodes_allowed = huge_mpol_nodes_allowed();
> +	if (nid == NO_NODEID_SPECIFIED)
> +		nodes_allowed = huge_mpol_nodes_allowed();
> +	else {
> +		/*
> +		 * incoming 'count' is for node 'nid' only, so
> +		 * adjust count to global, but restrict alloc/free
> +		 * to the specified node.
> +		 */
> +		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> +		nodes_allowed = alloc_nodemask_of_node(nid);
> +		if (!nodes_allowed)
> +			printk(KERN_WARNING "%s unable to allocate allowed "
> +			       "nodes mask for huge page allocation/free.  "
> +			       "Falling back to default.\n", current->comm);
> +	}
>  
>  	/*
>  	 * Increase the pool size
> @@ -1329,51 +1345,71 @@ out:
>  static struct kobject *hugepages_kobj;
>  static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
>  
> -static struct hstate *kobj_to_hstate(struct kobject *kobj)
> +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp);
> +
> +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
>  {
>  	int i;
> +
>  	for (i = 0; i < HUGE_MAX_HSTATE; i++)
> -		if (hstate_kobjs[i] == kobj)
> +		if (hstate_kobjs[i] == kobj) {
> +			if (nidp)
> +				*nidp = NO_NODEID_SPECIFIED;
>  			return &hstates[i];
> -	BUG();
> -	return NULL;
> +		}
> +
> +	return kobj_to_node_hstate(kobj, nidp);
>  }
>  
>  static ssize_t nr_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> -	return sprintf(buf, "%lu\n", h->nr_huge_pages);
> +	struct hstate *h;
> +	unsigned long nr_huge_pages;
> +	int nid;
> +
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (nid == NO_NODEID_SPECIFIED)
> +		nr_huge_pages = h->nr_huge_pages;
> +	else
> +		nr_huge_pages = h->nr_huge_pages_node[nid];
> +
> +	return sprintf(buf, "%lu\n", nr_huge_pages);
>  }
> +
>  static ssize_t nr_hugepages_store(struct kobject *kobj,
> -		struct kobj_attribute *attr, const char *buf, size_t count)
> +		struct kobj_attribute *attr, const char *buf, size_t len)
>  {
> +	unsigned long count;
> +	struct hstate *h;
> +	int nid;
>  	int err;
> -	unsigned long input;
> -	struct hstate *h = kobj_to_hstate(kobj);
>  
> -	err = strict_strtoul(buf, 10, &input);
> +	err = strict_strtoul(buf, 10, &count);
>  	if (err)
>  		return 0;
>  
> -	h->max_huge_pages = set_max_huge_pages(h, input);
> +	h = kobj_to_hstate(kobj, &nid);
> +	h->max_huge_pages = set_max_huge_pages(h, count, nid);
>  
> -	return count;
> +	return len;
>  }
>  HSTATE_ATTR(nr_hugepages);
>  
>  static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> +
>  	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
>  }
> +
>  static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
>  		struct kobj_attribute *attr, const char *buf, size_t count)
>  {
>  	int err;
>  	unsigned long input;
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h = kobj_to_hstate(kobj, NULL);
>  
>  	err = strict_strtoul(buf, 10, &input);
>  	if (err)
> @@ -1390,15 +1426,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
>  static ssize_t free_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> -	return sprintf(buf, "%lu\n", h->free_huge_pages);
> +	struct hstate *h;
> +	unsigned long free_huge_pages;
> +	int nid;
> +
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (nid == NO_NODEID_SPECIFIED)
> +		free_huge_pages = h->free_huge_pages;
> +	else
> +		free_huge_pages = h->free_huge_pages_node[nid];
> +
> +	return sprintf(buf, "%lu\n", free_huge_pages);
>  }
>  HSTATE_ATTR_RO(free_hugepages);
>  
>  static ssize_t resv_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h = kobj_to_hstate(kobj, NULL);
>  	return sprintf(buf, "%lu\n", h->resv_huge_pages);
>  }
>  HSTATE_ATTR_RO(resv_hugepages);
> @@ -1406,8 +1451,17 @@ HSTATE_ATTR_RO(resv_hugepages);
>  static ssize_t surplus_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> -	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
> +	struct hstate *h;
> +	unsigned long surplus_huge_pages;
> +	int nid;
> +
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (nid == NO_NODEID_SPECIFIED)
> +		surplus_huge_pages = h->surplus_huge_pages;
> +	else
> +		surplus_huge_pages = h->surplus_huge_pages_node[nid];
> +
> +	return sprintf(buf, "%lu\n", surplus_huge_pages);
>  }
>  HSTATE_ATTR_RO(surplus_hugepages);
>  
> @@ -1424,19 +1478,21 @@ static struct attribute_group hstate_att
>  	.attrs = hstate_attrs,
>  };
>  
> -static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
> +static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
> +				struct kobject *parent,
> +				struct kobject **hstate_kobjs,
> +				struct attribute_group *hstate_attr_group)
>  {
>  	int retval;
> +	int hi = h - hstates;
>  
> -	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
> -							hugepages_kobj);
> -	if (!hstate_kobjs[h - hstates])
> +	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
> +	if (!hstate_kobjs[hi])
>  		return -ENOMEM;
>  
> -	retval = sysfs_create_group(hstate_kobjs[h - hstates],
> -							&hstate_attr_group);
> +	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
>  	if (retval)
> -		kobject_put(hstate_kobjs[h - hstates]);
> +		kobject_put(hstate_kobjs[hi]);
>  
>  	return retval;
>  }
> @@ -1451,17 +1507,143 @@ static void __init hugetlb_sysfs_init(vo
>  		return;
>  
>  	for_each_hstate(h) {
> -		err = hugetlb_sysfs_add_hstate(h);
> +		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
> +					 hstate_kobjs, &hstate_attr_group);
>  		if (err)
>  			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
>  								h->name);
>  	}
>  }
>  
> +#ifdef CONFIG_NUMA
> +
> +struct node_hstate {
> +	struct kobject		*hugepages_kobj;
> +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
> +};
> +struct node_hstate node_hstates[MAX_NUMNODES];
> +
> +static struct attribute *per_node_hstate_attrs[] = {
> +	&nr_hugepages_attr.attr,
> +	&free_hugepages_attr.attr,
> +	&surplus_hugepages_attr.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group per_node_hstate_attr_group = {
> +	.attrs = per_node_hstate_attrs,
> +};
> +
> +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> +{
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		struct node_hstate *nhs = &node_hstates[nid];
> +		int i;
> +		for (i = 0; i < HUGE_MAX_HSTATE; i++)
> +			if (nhs->hstate_kobjs[i] == kobj) {
> +				if (nidp)
> +					*nidp = nid;
> +				return &hstates[i];
> +			}
> +	}
> +
> +	BUG();
> +	return NULL;
> +}
> +
> +void hugetlb_unregister_node(struct node *node)
> +{
> +	struct hstate *h;
> +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> +
> +	if (!nhs->hugepages_kobj)
> +		return;
> +
> +	for_each_hstate(h)
> +		if (nhs->hstate_kobjs[h - hstates]) {
> +			kobject_put(nhs->hstate_kobjs[h - hstates]);
> +			nhs->hstate_kobjs[h - hstates] = NULL;
> +		}
> +
> +	kobject_put(nhs->hugepages_kobj);
> +	nhs->hugepages_kobj = NULL;
> +}
> +
> +static void hugetlb_unregister_all_nodes(void)
> +{
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++)
> +		hugetlb_unregister_node(&node_devices[nid]);
> +
> +	register_hugetlbfs_with_node(NULL, NULL);
> +}
> +
> +void hugetlb_register_node(struct node *node)
> +{
> +	struct hstate *h;
> +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> +	int err;
> +
> +	if (nhs->hugepages_kobj)
> +		return;		/* already allocated */
> +
> +	nhs->hugepages_kobj = kobject_create_and_add("hugepages",
> +							&node->sysdev.kobj);
> +	if (!nhs->hugepages_kobj)
> +		return;
> +
> +	for_each_hstate(h) {
> +		err = hugetlb_sysfs_add_hstate(h, nhs->hugepages_kobj,
> +						nhs->hstate_kobjs,
> +						&per_node_hstate_attr_group);
> +		if (err) {
> +			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
> +					" for node %d\n",
> +						h->name, node->sysdev.id);

Maybe add `err' to the printk so we know whether it was an -ENOMEM 
condition or sysfs problem?

> +			hugetlb_unregister_node(node);
> +			break;
> +		}
> +	}
> +}
> +
> +static void hugetlb_register_all_nodes(void)
> +{
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {

Don't you want to do this for all nodes in N_HIGH_MEMORY?  I don't think 
we should be adding attributes for memoryless nodes.

> +		struct node *node = &node_devices[nid];
> +		if (node->sysdev.id == nid)
> +			hugetlb_register_node(node);
> +	}
> +
> +	register_hugetlbfs_with_node(hugetlb_register_node,
> +                                     hugetlb_unregister_node);
> +}
> +#else	/* !CONFIG_NUMA */
> +
> +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> +{
> +	BUG();
> +	if (nidp)
> +		*nidp = -1;
> +	return NULL;
> +}
> +
> +static void hugetlb_unregister_all_nodes(void) { }
> +
> +static void hugetlb_register_all_nodes(void) { }
> +
> +#endif
> +
>  static void __exit hugetlb_exit(void)
>  {
>  	struct hstate *h;
>  
> +	hugetlb_unregister_all_nodes();
> +
>  	for_each_hstate(h) {
>  		kobject_put(hstate_kobjs[h - hstates]);
>  	}
> @@ -1496,6 +1678,8 @@ static int __init hugetlb_init(void)
>  
>  	hugetlb_sysfs_init();
>  
> +	hugetlb_register_all_nodes();
> +
>  	return 0;
>  }
>  module_init(hugetlb_init);
> @@ -1598,7 +1782,8 @@ int hugetlb_sysctl_handler(struct ctl_ta
>  	proc_doulongvec_minmax(table, write, buffer, length, ppos);
>  
>  	if (write)
> -		h->max_huge_pages = set_max_huge_pages(h, tmp);
> +		h->max_huge_pages = set_max_huge_pages(h, tmp,
> +		                                       NO_NODEID_SPECIFIED);
>  
>  	return 0;
>  }
> Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/numa.h	2009-08-28 09:21:17.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h	2009-08-28 09:21:31.000000000 -0400
> @@ -10,4 +10,6 @@
>  
>  #define MAX_NUMNODES    (1 << NODES_SHIFT)
>  
> +#define NO_NODEID_SPECIFIED	(-1)
> +
>  #endif /* _LINUX_NUMA_H */

Hmm, so we already have NUMA_NO_NODE in the ia64 and x86_64 code and 
NID_INVAL in the ACPI code, both of which are defined to -1.  Maybe rename 
your addition here in favor of NUMA_NO_NODE, remove it from the ia64 and 
x86 arch headers, and convert the ACPI code?

Thanks for doing this!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 5/6] hugetlb:  add per node hstate attributes
@ 2009-09-03 19:52     ` David Rientjes
  0 siblings, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-03 19:52 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, Andrew Morton, Mel Gorman, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

On Fri, 28 Aug 2009, Lee Schermerhorn wrote:

> Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/hugetlb.c	2009-08-28 09:21:28.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c	2009-08-28 09:21:31.000000000 -0400
> @@ -24,6 +24,7 @@
>  #include <asm/io.h>
>  
>  #include <linux/hugetlb.h>
> +#include <linux/node.h>
>  #include "internal.h"
>  
>  const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> @@ -1245,7 +1246,8 @@ static int adjust_pool_surplus(struct hs
>  }
>  
>  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> +								int nid)
>  {
>  	unsigned long min_count, ret;
>  	nodemask_t *nodes_allowed;
> @@ -1253,7 +1255,21 @@ static unsigned long set_max_huge_pages(
>  	if (h->order >= MAX_ORDER)
>  		return h->max_huge_pages;
>  
> -	nodes_allowed = huge_mpol_nodes_allowed();
> +	if (nid == NO_NODEID_SPECIFIED)
> +		nodes_allowed = huge_mpol_nodes_allowed();
> +	else {
> +		/*
> +		 * incoming 'count' is for node 'nid' only, so
> +		 * adjust count to global, but restrict alloc/free
> +		 * to the specified node.
> +		 */
> +		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> +		nodes_allowed = alloc_nodemask_of_node(nid);
> +		if (!nodes_allowed)
> +			printk(KERN_WARNING "%s unable to allocate allowed "
> +			       "nodes mask for huge page allocation/free.  "
> +			       "Falling back to default.\n", current->comm);
> +	}
>  
>  	/*
>  	 * Increase the pool size
> @@ -1329,51 +1345,71 @@ out:
>  static struct kobject *hugepages_kobj;
>  static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
>  
> -static struct hstate *kobj_to_hstate(struct kobject *kobj)
> +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp);
> +
> +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
>  {
>  	int i;
> +
>  	for (i = 0; i < HUGE_MAX_HSTATE; i++)
> -		if (hstate_kobjs[i] == kobj)
> +		if (hstate_kobjs[i] == kobj) {
> +			if (nidp)
> +				*nidp = NO_NODEID_SPECIFIED;
>  			return &hstates[i];
> -	BUG();
> -	return NULL;
> +		}
> +
> +	return kobj_to_node_hstate(kobj, nidp);
>  }
>  
>  static ssize_t nr_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> -	return sprintf(buf, "%lu\n", h->nr_huge_pages);
> +	struct hstate *h;
> +	unsigned long nr_huge_pages;
> +	int nid;
> +
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (nid == NO_NODEID_SPECIFIED)
> +		nr_huge_pages = h->nr_huge_pages;
> +	else
> +		nr_huge_pages = h->nr_huge_pages_node[nid];
> +
> +	return sprintf(buf, "%lu\n", nr_huge_pages);
>  }
> +
>  static ssize_t nr_hugepages_store(struct kobject *kobj,
> -		struct kobj_attribute *attr, const char *buf, size_t count)
> +		struct kobj_attribute *attr, const char *buf, size_t len)
>  {
> +	unsigned long count;
> +	struct hstate *h;
> +	int nid;
>  	int err;
> -	unsigned long input;
> -	struct hstate *h = kobj_to_hstate(kobj);
>  
> -	err = strict_strtoul(buf, 10, &input);
> +	err = strict_strtoul(buf, 10, &count);
>  	if (err)
>  		return 0;
>  
> -	h->max_huge_pages = set_max_huge_pages(h, input);
> +	h = kobj_to_hstate(kobj, &nid);
> +	h->max_huge_pages = set_max_huge_pages(h, count, nid);
>  
> -	return count;
> +	return len;
>  }
>  HSTATE_ATTR(nr_hugepages);
>  
>  static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> +
>  	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
>  }
> +
>  static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
>  		struct kobj_attribute *attr, const char *buf, size_t count)
>  {
>  	int err;
>  	unsigned long input;
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h = kobj_to_hstate(kobj, NULL);
>  
>  	err = strict_strtoul(buf, 10, &input);
>  	if (err)
> @@ -1390,15 +1426,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
>  static ssize_t free_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> -	return sprintf(buf, "%lu\n", h->free_huge_pages);
> +	struct hstate *h;
> +	unsigned long free_huge_pages;
> +	int nid;
> +
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (nid == NO_NODEID_SPECIFIED)
> +		free_huge_pages = h->free_huge_pages;
> +	else
> +		free_huge_pages = h->free_huge_pages_node[nid];
> +
> +	return sprintf(buf, "%lu\n", free_huge_pages);
>  }
>  HSTATE_ATTR_RO(free_hugepages);
>  
>  static ssize_t resv_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h = kobj_to_hstate(kobj, NULL);
>  	return sprintf(buf, "%lu\n", h->resv_huge_pages);
>  }
>  HSTATE_ATTR_RO(resv_hugepages);
> @@ -1406,8 +1451,17 @@ HSTATE_ATTR_RO(resv_hugepages);
>  static ssize_t surplus_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> -	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
> +	struct hstate *h;
> +	unsigned long surplus_huge_pages;
> +	int nid;
> +
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (nid == NO_NODEID_SPECIFIED)
> +		surplus_huge_pages = h->surplus_huge_pages;
> +	else
> +		surplus_huge_pages = h->surplus_huge_pages_node[nid];
> +
> +	return sprintf(buf, "%lu\n", surplus_huge_pages);
>  }
>  HSTATE_ATTR_RO(surplus_hugepages);
>  
> @@ -1424,19 +1478,21 @@ static struct attribute_group hstate_att
>  	.attrs = hstate_attrs,
>  };
>  
> -static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
> +static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
> +				struct kobject *parent,
> +				struct kobject **hstate_kobjs,
> +				struct attribute_group *hstate_attr_group)
>  {
>  	int retval;
> +	int hi = h - hstates;
>  
> -	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
> -							hugepages_kobj);
> -	if (!hstate_kobjs[h - hstates])
> +	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
> +	if (!hstate_kobjs[hi])
>  		return -ENOMEM;
>  
> -	retval = sysfs_create_group(hstate_kobjs[h - hstates],
> -							&hstate_attr_group);
> +	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
>  	if (retval)
> -		kobject_put(hstate_kobjs[h - hstates]);
> +		kobject_put(hstate_kobjs[hi]);
>  
>  	return retval;
>  }
> @@ -1451,17 +1507,143 @@ static void __init hugetlb_sysfs_init(vo
>  		return;
>  
>  	for_each_hstate(h) {
> -		err = hugetlb_sysfs_add_hstate(h);
> +		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
> +					 hstate_kobjs, &hstate_attr_group);
>  		if (err)
>  			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
>  								h->name);
>  	}
>  }
>  
> +#ifdef CONFIG_NUMA
> +
> +struct node_hstate {
> +	struct kobject		*hugepages_kobj;
> +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
> +};
> +struct node_hstate node_hstates[MAX_NUMNODES];
> +
> +static struct attribute *per_node_hstate_attrs[] = {
> +	&nr_hugepages_attr.attr,
> +	&free_hugepages_attr.attr,
> +	&surplus_hugepages_attr.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group per_node_hstate_attr_group = {
> +	.attrs = per_node_hstate_attrs,
> +};
> +
> +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> +{
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		struct node_hstate *nhs = &node_hstates[nid];
> +		int i;
> +		for (i = 0; i < HUGE_MAX_HSTATE; i++)
> +			if (nhs->hstate_kobjs[i] == kobj) {
> +				if (nidp)
> +					*nidp = nid;
> +				return &hstates[i];
> +			}
> +	}
> +
> +	BUG();
> +	return NULL;
> +}
> +
> +void hugetlb_unregister_node(struct node *node)
> +{
> +	struct hstate *h;
> +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> +
> +	if (!nhs->hugepages_kobj)
> +		return;
> +
> +	for_each_hstate(h)
> +		if (nhs->hstate_kobjs[h - hstates]) {
> +			kobject_put(nhs->hstate_kobjs[h - hstates]);
> +			nhs->hstate_kobjs[h - hstates] = NULL;
> +		}
> +
> +	kobject_put(nhs->hugepages_kobj);
> +	nhs->hugepages_kobj = NULL;
> +}
> +
> +static void hugetlb_unregister_all_nodes(void)
> +{
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++)
> +		hugetlb_unregister_node(&node_devices[nid]);
> +
> +	register_hugetlbfs_with_node(NULL, NULL);
> +}
> +
> +void hugetlb_register_node(struct node *node)
> +{
> +	struct hstate *h;
> +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> +	int err;
> +
> +	if (nhs->hugepages_kobj)
> +		return;		/* already allocated */
> +
> +	nhs->hugepages_kobj = kobject_create_and_add("hugepages",
> +							&node->sysdev.kobj);
> +	if (!nhs->hugepages_kobj)
> +		return;
> +
> +	for_each_hstate(h) {
> +		err = hugetlb_sysfs_add_hstate(h, nhs->hugepages_kobj,
> +						nhs->hstate_kobjs,
> +						&per_node_hstate_attr_group);
> +		if (err) {
> +			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
> +					" for node %d\n",
> +						h->name, node->sysdev.id);

Maybe add `err' to the printk so we know whether it was an -ENOMEM 
condition or sysfs problem?

> +			hugetlb_unregister_node(node);
> +			break;
> +		}
> +	}
> +}
> +
> +static void hugetlb_register_all_nodes(void)
> +{
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {

Don't you want to do this for all nodes in N_HIGH_MEMORY?  I don't think 
we should be adding attributes for memoryless nodes.

> +		struct node *node = &node_devices[nid];
> +		if (node->sysdev.id == nid)
> +			hugetlb_register_node(node);
> +	}
> +
> +	register_hugetlbfs_with_node(hugetlb_register_node,
> +                                     hugetlb_unregister_node);
> +}
> +#else	/* !CONFIG_NUMA */
> +
> +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> +{
> +	BUG();
> +	if (nidp)
> +		*nidp = -1;
> +	return NULL;
> +}
> +
> +static void hugetlb_unregister_all_nodes(void) { }
> +
> +static void hugetlb_register_all_nodes(void) { }
> +
> +#endif
> +
>  static void __exit hugetlb_exit(void)
>  {
>  	struct hstate *h;
>  
> +	hugetlb_unregister_all_nodes();
> +
>  	for_each_hstate(h) {
>  		kobject_put(hstate_kobjs[h - hstates]);
>  	}
> @@ -1496,6 +1678,8 @@ static int __init hugetlb_init(void)
>  
>  	hugetlb_sysfs_init();
>  
> +	hugetlb_register_all_nodes();
> +
>  	return 0;
>  }
>  module_init(hugetlb_init);
> @@ -1598,7 +1782,8 @@ int hugetlb_sysctl_handler(struct ctl_ta
>  	proc_doulongvec_minmax(table, write, buffer, length, ppos);
>  
>  	if (write)
> -		h->max_huge_pages = set_max_huge_pages(h, tmp);
> +		h->max_huge_pages = set_max_huge_pages(h, tmp,
> +		                                       NO_NODEID_SPECIFIED);
>  
>  	return 0;
>  }
> Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/numa.h	2009-08-28 09:21:17.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h	2009-08-28 09:21:31.000000000 -0400
> @@ -10,4 +10,6 @@
>  
>  #define MAX_NUMNODES    (1 << NODES_SHIFT)
>  
> +#define NO_NODEID_SPECIFIED	(-1)
> +
>  #endif /* _LINUX_NUMA_H */

Hmm, so we already have NUMA_NO_NODE in the ia64 and x86_64 code and 
NID_INVAL in the ACPI code, both of which are defined to -1.  Maybe rename 
your addition here in favor of NUMA_NO_NODE, remove it from the ia64 and 
x86 arch headers, and convert the ACPI code?

Thanks for doing this!

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-08-28 16:03   ` Lee Schermerhorn
@ 2009-09-03 20:07     ` David Rientjes
  -1 siblings, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-03 20:07 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, Andrew Morton, Mel Gorman, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney,
	Randy Dunlap

On Fri, 28 Aug 2009, Lee Schermerhorn wrote:

> [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
> 
> Against: 2.6.31-rc7-mmotm-090827-0057
> 
> V2:  Add brief description of per node attributes.
> 
> This patch updates the kernel huge tlb documentation to describe the
> numa memory policy based huge page management.  Additionaly, the patch
> includes a fair amount of rework to improve consistency, eliminate
> duplication and set the context for documenting the memory policy
> interaction.
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

Adding Randy to the cc.  Comments below, but otherwise:

Acked-by: David Rientjes <rientjes@google.com>

> 
>  Documentation/vm/hugetlbpage.txt |  257 ++++++++++++++++++++++++++-------------
>  1 file changed, 172 insertions(+), 85 deletions(-)
> 
> Index: linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/Documentation/vm/hugetlbpage.txt	2009-08-28 09:21:16.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt	2009-08-28 09:21:32.000000000 -0400
> @@ -11,23 +11,21 @@ This optimization is more critical now a
>  (several GBs) are more readily available.
>  
>  Users can use the huge page support in Linux kernel by either using the mmap
> -system call or standard SYSv shared memory system calls (shmget, shmat).
> +system call or standard SYSV shared memory system calls (shmget, shmat).
>  
>  First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
>  (present under "File systems") and CONFIG_HUGETLB_PAGE (selected
>  automatically when CONFIG_HUGETLBFS is selected) configuration
>  options.
>  
> -The kernel built with huge page support should show the number of configured
> -huge pages in the system by running the "cat /proc/meminfo" command.
> +The /proc/meminfo file provides information about the total number of hugetlb
> +pages preallocated in the kernel's huge page pool.  It also displays
> +information about the number of free, reserved and surplus huge pages and the
> +[default] huge page size.  The huge page size is needed for generating the

Don't think the brackets are needed.

> +proper alignment and size of the arguments to system calls that map huge page
> +regions.
>  
> -/proc/meminfo also provides information about the total number of hugetlb
> -pages configured in the kernel.  It also displays information about the
> -number of free hugetlb pages at any time.  It also displays information about
> -the configured huge page size - this is needed for generating the proper
> -alignment and size of the arguments to the above system calls.
> -
> -The output of "cat /proc/meminfo" will have lines like:
> +The output of "cat /proc/meminfo" will include lines like:
>  
>  .....
>  HugePages_Total: vvv
> @@ -53,26 +51,25 @@ HugePages_Surp  is short for "surplus,"
>  /proc/filesystems should also show a filesystem of type "hugetlbfs" configured
>  in the kernel.
>  
> -/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
> -pages in the kernel.  Super user can dynamically request more (or free some
> -pre-configured) huge pages.
> -The allocation (or deallocation) of hugetlb pages is possible only if there are
> -enough physically contiguous free pages in system (freeing of huge pages is
> -possible only if there are enough hugetlb pages free that can be transferred
> -back to regular memory pool).
> -
> -Pages that are used as hugetlb pages are reserved inside the kernel and cannot
> -be used for other purposes.
> -
> -Once the kernel with Hugetlb page support is built and running, a user can
> -use either the mmap system call or shared memory system calls to start using
> -the huge pages.  It is required that the system administrator preallocate
> -enough memory for huge page purposes.
> -
> -The administrator can preallocate huge pages on the kernel boot command line by
> -specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
> -requested.  This is the most reliable method for preallocating huge pages as
> -memory has not yet become fragmented.
> +/proc/sys/vm/nr_hugepages indicates the current number of huge pages pre-
> +allocated in the kernel's huge page pool.  These are called "persistent"
> +huge pages.  A user with root privileges can dynamically allocate more or
> +free some persistent huge pages by increasing or decreasing the value of
> +'nr_hugepages'.
> +

So they're not necessarily "preallocated" then if they're already in use.

> +Pages that are used as huge pages are reserved inside the kernel and cannot
> +be used for other purposes.  Huge pages can not be swapped out under
> +memory pressure.
> +
> +Once a number of huge pages have been pre-allocated to the kernel huge page
> +pool, a user with appropriate privilege can use either the mmap system call
> +or shared memory system calls to use the huge pages.  See the discussion of
> +Using Huge Pages, below
> +
> +The administrator can preallocate persistent huge pages on the kernel boot
> +command line by specifying the "hugepages=N" parameter, where 'N' = the
> +number of requested huge pages requested.  This is the most reliable method
> +or preallocating huge pages as memory has not yet become fragmented.
>  
>  Some platforms support multiple huge page sizes.  To preallocate huge pages
>  of a specific size, one must preceed the huge pages boot command parameters
> @@ -80,19 +77,24 @@ with a huge page size selection paramete
>  be specified in bytes with optional scale suffix [kKmMgG].  The default huge
>  page size may be selected with the "default_hugepagesz=<size>" boot parameter.
>  
> -/proc/sys/vm/nr_hugepages indicates the current number of configured [default
> -size] hugetlb pages in the kernel.  Super user can dynamically request more
> -(or free some pre-configured) huge pages.
> -
> -Use the following command to dynamically allocate/deallocate default sized
> -huge pages:
> +When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
> +indicates the current number of pre-allocated huge pages of the default size.
> +Thus, one can use the following command to dynamically allocate/deallocate
> +default sized persistent huge pages:
>  
>  	echo 20 > /proc/sys/vm/nr_hugepages
>  
> -This command will try to configure 20 default sized huge pages in the system.
> +This command will try to adjust the number of default sized huge pages in the
> +huge page pool to 20, allocating or freeing huge pages, as required.
> +
>  On a NUMA platform, the kernel will attempt to distribute the huge page pool
> -over the all on-line nodes.  These huge pages, allocated when nr_hugepages
> -is increased, are called "persistent huge pages".
> +over the all the nodes specified by the NUMA memory policy of the task that

Remove the first 'the'.

> +modifies nr_hugepages that contain sufficient available contiguous memory.
> +These nodes are called the huge pages "allowed nodes".  The default for the

Not sure if you need to spell out that they're called "huge page allowed 
nodes," isn't that an implementation detail?  The way Paul Jackson used to 
describe nodes_allowed is "set of allowable nodes," and I can't think of a 
better phrase.  That's also how the cpuset documentation describes them.

> +huge pages allowed nodes--when the task has default memory policy--is all
> +on-line nodes.  See the discussion below of the interaction of task memory

All online nodes with memory, right?

> +policy, cpusets and per node attributes with the allocation and freeing of
> +persistent huge pages.
>  
>  The success or failure of huge page allocation depends on the amount of
>  physically contiguous memory that is preset in system at the time of the
> @@ -101,11 +103,11 @@ some nodes in a NUMA system, it will att
>  allocating extra pages on other nodes with sufficient available contiguous
>  memory, if any.
>  
> -System administrators may want to put this command in one of the local rc init
> -files.  This will enable the kernel to request huge pages early in the boot
> -process when the possibility of getting physical contiguous pages is still
> -very high.  Administrators can verify the number of huge pages actually
> -allocated by checking the sysctl or meminfo.  To check the per node
> +System administrators may want to put this command in one of the local rc
> +init files.  This will enable the kernel to preallocate huge pages early in
> +the boot process when the possibility of getting physical contiguous pages
> +is still very high.  Administrators can verify the number of huge pages
> +actually allocated by checking the sysctl or meminfo.  To check the per node
>  distribution of huge pages in a NUMA system, use:
>  
>  	cat /sys/devices/system/node/node*/meminfo | fgrep Huge
> @@ -113,39 +115,40 @@ distribution of huge pages in a NUMA sys
>  /proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
>  huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
>  requested by applications.  Writing any non-zero value into this file
> -indicates that the hugetlb subsystem is allowed to try to obtain "surplus"
> -huge pages from the buddy allocator, when the normal pool is exhausted. As
> -these surplus huge pages go out of use, they are freed back to the buddy
> -allocator.
> +indicates that the hugetlb subsystem is allowed to try to obtain that
> +number of "surplus" huge pages from the kernel's normal page pool, when the
> +persistent huge page pool is exhausted. As these surplus huge pages become
> +unused, they are freed back to the kernel's normal page pool.
>  
> -When increasing the huge page pool size via nr_hugepages, any surplus
> +When increasing the huge page pool size via nr_hugepages, any existing surplus
>  pages will first be promoted to persistent huge pages.  Then, additional
>  huge pages will be allocated, if necessary and if possible, to fulfill
> -the new huge page pool size.
> +the new persistent huge page pool size.
>  
>  The administrator may shrink the pool of preallocated huge pages for
>  the default huge page size by setting the nr_hugepages sysctl to a
>  smaller value.  The kernel will attempt to balance the freeing of huge pages
> -across all on-line nodes.  Any free huge pages on the selected nodes will
> -be freed back to the buddy allocator.
> -
> -Caveat: Shrinking the pool via nr_hugepages such that it becomes less
> -than the number of huge pages in use will convert the balance to surplus
> -huge pages even if it would exceed the overcommit value.  As long as
> -this condition holds, however, no more surplus huge pages will be
> -allowed on the system until one of the two sysctls are increased
> -sufficiently, or the surplus huge pages go out of use and are freed.
> +across all nodes in the memory policy of the task modifying nr_hugepages.
> +Any free huge pages on the selected nodes will be freed back to the kernel's
> +normal page pool.
> +
> +Caveat: Shrinking the persistent huge page pool via nr_hugepages such that
> +it becomes less than the number of huge pages in use will convert the balance
> +of the in-use huge pages to surplus huge pages.  This will occur even if
> +the number of surplus pages it would exceed the overcommit value.  As long as
> +this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is
> +increased sufficiently, or the surplus huge pages go out of use and are freed--
> +no more surplus huge pages will be allowed to be allocated.
>  

Nice description!

>  With support for multiple huge page pools at run-time available, much of
> -the huge page userspace interface has been duplicated in sysfs. The above
> -information applies to the default huge page size which will be
> -controlled by the /proc interfaces for backwards compatibility. The root
> -huge page control directory in sysfs is:
> +the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs.
> +The /proc interfaces discussed above have been retained for backwards
> +compatibility. The root huge page control directory in sysfs is:
>  
>  	/sys/kernel/mm/hugepages
>  
>  For each huge page size supported by the running kernel, a subdirectory
> -will exist, of the form
> +will exist, of the form:
>  
>  	hugepages-${size}kB
>  
> @@ -159,6 +162,98 @@ Inside each of these directories, the sa
>  
>  which function as described above for the default huge page-sized case.
>  
> +
> +Interaction of Task Memory Policy with Huge Page Allocation/Freeing:
> +
> +Whether huge pages are allocated and freed via the /proc interface or
> +the /sysfs interface, the NUMA nodes from which huge pages are allocated
> +or freed are controlled by the NUMA memory policy of the task that modifies
> +the nr_hugepages parameter.  [nr_overcommit_hugepages is a global limit.]
> +
> +The recommended method to allocate or free huge pages to/from the kernel
> +huge page pool, using the nr_hugepages example above, is:
> +
> +    numactl --interleave <node-list> echo 20 >/proc/sys/vm/nr_hugepages.
> +
> +or, more succinctly:
> +
> +    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages.
> +
> +This will allocate or free abs(20 - nr_hugepages) to or from the nodes
> +specified in <node-list>, depending on whether nr_hugepages is initially
> +less than or greater than 20, respectively.  No huge pages will be
> +allocated nor freed on any node not included in the specified <node-list>.
> +

This is actually why I was against the mempolicy approach to begin with: 
applications currently can free all hugepages on the system simply by 
writing to nr_hugepages, regardless of their mempolicy.  It's now possible 
that hugepages will remain allocated because they are on nodes disjoint 
from current->mempolicy->v.nodes.  I hope the advantages of this approach 
outweigh the potential userspace breakage of existing applications.

> +Any memory policy mode--bind, preferred, local or interleave--may be
> +used.  The effect on persistent huge page allocation will be as follows:
> +
> +1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
> +   persistent huge pages will be distributed across the node or nodes
> +   specified in the mempolicy as if "interleave" had been specified.
> +   However, if a node in the policy does not contain sufficient contiguous
> +   memory for a huge page, the allocation will not "fallback" to the nearest
> +   neighbor node with sufficient contiguous memory.  To do this would cause
> +   undesirable imbalance in the distribution of the huge page pool, or
> +   possibly, allocation of persistent huge pages on nodes not allowed by
> +   the task's memory policy.
> +

This is a good example of why the per-node tunables are helpful in case 
such a fallback is desired.

> +2) One or more nodes may be specified with the bind or interleave policy.
> +   If more than one node is specified with the preferred policy, only the
> +   lowest numeric id will be used.  Local policy will select the node where
> +   the task is running at the time the nodes_allowed mask is constructed.
> +
> +3) For local policy to be deterministic, the task must be bound to a cpu or
> +   cpus in a single node.  Otherwise, the task could be migrated to some
> +   other node at any time after launch and the resulting node will be
> +   indeterminate.  Thus, local policy is not very useful for this purpose.
> +   Any of the other mempolicy modes may be used to specify a single node.
> +
> +4) The nodes allowed mask will be derived from any non-default task mempolicy,
> +   whether this policy was set explicitly by the task itself or one of its
> +   ancestors, such as numactl.  This means that if the task is invoked from a
> +   shell with non-default policy, that policy will be used.  One can specify a
> +   node list of "all" with numactl --interleave or --membind [-m] to achieve
> +   interleaving over all nodes in the system or cpuset.
> +

Nice description.

> +5) Any task mempolicy specifed--e.g., using numactl--will be constrained by
> +   the resource limits of any cpuset in which the task runs.  Thus, there will
> +   be no way for a task with non-default policy running in a cpuset with a
> +   subset of the system nodes to allocate huge pages outside the cpuset
> +   without first moving to a cpuset that contains all of the desired nodes.
> +
> +6) Hugepages allocated at boot time always use the node_online_map.

Implementation detail in the name, maybe just say "all online nodes with 
memory"?

> +
> +
> +Per Node Hugepages Attributes
> +
> +A subset of the contents of the root huge page control directory in sysfs,
> +described above, has been replicated under each "node" system device in:
> +
> +	/sys/devices/system/node/node[0-9]*/hugepages/
> +
> +Under this directory, the subdirectory for each supported huge page size
> +contains the following attribute files:
> +
> +	nr_hugepages
> +	free_hugepages
> +	surplus_hugepages
> +
> +The free_' and surplus_' attribute files are read-only.  They return the number
> +of free and surplus [overcommitted] huge pages, respectively, on the parent
> +node.
> +
> +The nr_hugepages attribute will return the total number of huge pages on the
> +specified node.  When this attribute is written, the number of persistent huge
> +pages on the parent node will be adjusted to the specified value, if sufficient
> +resources exist, regardless of the task's mempolicy or cpuset constraints.
> +
> +Note that the number of overcommit and reserve pages remain global quantities,
> +as we don't know until fault time, when the faulting task's mempolicy is applied,
> +from which node the huge page allocation will be attempted.
> +
> +
> +Using Huge Pages:
> +
>  If the user applications are going to request huge pages using mmap system
>  call, then it is required that system administrator mount a file system of
>  type hugetlbfs:
> @@ -206,9 +301,11 @@ map_hugetlb.c.
>   * requesting huge pages.
>   *
>   * For the ia64 architecture, the Linux kernel reserves Region number 4 for
> - * huge pages.  That means the addresses starting with 0x800000... will need
> - * to be specified.  Specifying a fixed address is not required on ppc64,
> - * i386 or x86_64.
> + * huge pages.  That means that if one requires a fixed address, a huge page
> + * aligned address starting with 0x800000... will be required.  If a fixed
> + * address is not required, the kernel will select an address in the proper
> + * range.
> + * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
>   *
>   * Note: The default shared memory limit is quite low on many kernels,
>   * you may need to increase it via:
> @@ -237,14 +334,8 @@ map_hugetlb.c.
>  
>  #define dprintf(x)  printf(x)
>  
> -/* Only ia64 requires this */
> -#ifdef __ia64__
> -#define ADDR (void *)(0x8000000000000000UL)
> -#define SHMAT_FLAGS (SHM_RND)
> -#else
> -#define ADDR (void *)(0x0UL)
> +#define ADDR (void *)(0x0UL)	/* let kernel choose address */
>  #define SHMAT_FLAGS (0)
> -#endif
>  
>  int main(void)
>  {
> @@ -302,10 +393,12 @@ int main(void)
>   * example, the app is requesting memory of size 256MB that is backed by
>   * huge pages.
>   *
> - * For ia64 architecture, Linux kernel reserves Region number 4 for huge pages.
> - * That means the addresses starting with 0x800000... will need to be
> - * specified.  Specifying a fixed address is not required on ppc64, i386
> - * or x86_64.
> + * For the ia64 architecture, the Linux kernel reserves Region number 4 for
> + * huge pages.  That means that if one requires a fixed address, a huge page
> + * aligned address starting with 0x800000... will be required.  If a fixed
> + * address is not required, the kernel will select an address in the proper
> + * range.
> + * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
>   */
>  #include <stdlib.h>
>  #include <stdio.h>
> @@ -317,14 +410,8 @@ int main(void)
>  #define LENGTH (256UL*1024*1024)
>  #define PROTECTION (PROT_READ | PROT_WRITE)
>  
> -/* Only ia64 requires this */
> -#ifdef __ia64__
> -#define ADDR (void *)(0x8000000000000000UL)
> -#define FLAGS (MAP_SHARED | MAP_FIXED)
> -#else
> -#define ADDR (void *)(0x0UL)
> +#define ADDR (void *)(0x0UL)	/* let kernel choose address */
>  #define FLAGS (MAP_SHARED)
> -#endif
>  
>  void check_bytes(char *addr)
>  {
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
@ 2009-09-03 20:07     ` David Rientjes
  0 siblings, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-03 20:07 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, Andrew Morton, Mel Gorman, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney,
	Randy Dunlap

On Fri, 28 Aug 2009, Lee Schermerhorn wrote:

> [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
> 
> Against: 2.6.31-rc7-mmotm-090827-0057
> 
> V2:  Add brief description of per node attributes.
> 
> This patch updates the kernel huge tlb documentation to describe the
> numa memory policy based huge page management.  Additionaly, the patch
> includes a fair amount of rework to improve consistency, eliminate
> duplication and set the context for documenting the memory policy
> interaction.
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

Adding Randy to the cc.  Comments below, but otherwise:

Acked-by: David Rientjes <rientjes@google.com>

> 
>  Documentation/vm/hugetlbpage.txt |  257 ++++++++++++++++++++++++++-------------
>  1 file changed, 172 insertions(+), 85 deletions(-)
> 
> Index: linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/Documentation/vm/hugetlbpage.txt	2009-08-28 09:21:16.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt	2009-08-28 09:21:32.000000000 -0400
> @@ -11,23 +11,21 @@ This optimization is more critical now a
>  (several GBs) are more readily available.
>  
>  Users can use the huge page support in Linux kernel by either using the mmap
> -system call or standard SYSv shared memory system calls (shmget, shmat).
> +system call or standard SYSV shared memory system calls (shmget, shmat).
>  
>  First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
>  (present under "File systems") and CONFIG_HUGETLB_PAGE (selected
>  automatically when CONFIG_HUGETLBFS is selected) configuration
>  options.
>  
> -The kernel built with huge page support should show the number of configured
> -huge pages in the system by running the "cat /proc/meminfo" command.
> +The /proc/meminfo file provides information about the total number of hugetlb
> +pages preallocated in the kernel's huge page pool.  It also displays
> +information about the number of free, reserved and surplus huge pages and the
> +[default] huge page size.  The huge page size is needed for generating the

Don't think the brackets are needed.

> +proper alignment and size of the arguments to system calls that map huge page
> +regions.
>  
> -/proc/meminfo also provides information about the total number of hugetlb
> -pages configured in the kernel.  It also displays information about the
> -number of free hugetlb pages at any time.  It also displays information about
> -the configured huge page size - this is needed for generating the proper
> -alignment and size of the arguments to the above system calls.
> -
> -The output of "cat /proc/meminfo" will have lines like:
> +The output of "cat /proc/meminfo" will include lines like:
>  
>  .....
>  HugePages_Total: vvv
> @@ -53,26 +51,25 @@ HugePages_Surp  is short for "surplus,"
>  /proc/filesystems should also show a filesystem of type "hugetlbfs" configured
>  in the kernel.
>  
> -/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
> -pages in the kernel.  Super user can dynamically request more (or free some
> -pre-configured) huge pages.
> -The allocation (or deallocation) of hugetlb pages is possible only if there are
> -enough physically contiguous free pages in system (freeing of huge pages is
> -possible only if there are enough hugetlb pages free that can be transferred
> -back to regular memory pool).
> -
> -Pages that are used as hugetlb pages are reserved inside the kernel and cannot
> -be used for other purposes.
> -
> -Once the kernel with Hugetlb page support is built and running, a user can
> -use either the mmap system call or shared memory system calls to start using
> -the huge pages.  It is required that the system administrator preallocate
> -enough memory for huge page purposes.
> -
> -The administrator can preallocate huge pages on the kernel boot command line by
> -specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
> -requested.  This is the most reliable method for preallocating huge pages as
> -memory has not yet become fragmented.
> +/proc/sys/vm/nr_hugepages indicates the current number of huge pages pre-
> +allocated in the kernel's huge page pool.  These are called "persistent"
> +huge pages.  A user with root privileges can dynamically allocate more or
> +free some persistent huge pages by increasing or decreasing the value of
> +'nr_hugepages'.
> +

So they're not necessarily "preallocated" then if they're already in use.

> +Pages that are used as huge pages are reserved inside the kernel and cannot
> +be used for other purposes.  Huge pages can not be swapped out under
> +memory pressure.
> +
> +Once a number of huge pages have been pre-allocated to the kernel huge page
> +pool, a user with appropriate privilege can use either the mmap system call
> +or shared memory system calls to use the huge pages.  See the discussion of
> +Using Huge Pages, below
> +
> +The administrator can preallocate persistent huge pages on the kernel boot
> +command line by specifying the "hugepages=N" parameter, where 'N' = the
> +number of requested huge pages requested.  This is the most reliable method
> +or preallocating huge pages as memory has not yet become fragmented.
>  
>  Some platforms support multiple huge page sizes.  To preallocate huge pages
>  of a specific size, one must preceed the huge pages boot command parameters
> @@ -80,19 +77,24 @@ with a huge page size selection paramete
>  be specified in bytes with optional scale suffix [kKmMgG].  The default huge
>  page size may be selected with the "default_hugepagesz=<size>" boot parameter.
>  
> -/proc/sys/vm/nr_hugepages indicates the current number of configured [default
> -size] hugetlb pages in the kernel.  Super user can dynamically request more
> -(or free some pre-configured) huge pages.
> -
> -Use the following command to dynamically allocate/deallocate default sized
> -huge pages:
> +When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
> +indicates the current number of pre-allocated huge pages of the default size.
> +Thus, one can use the following command to dynamically allocate/deallocate
> +default sized persistent huge pages:
>  
>  	echo 20 > /proc/sys/vm/nr_hugepages
>  
> -This command will try to configure 20 default sized huge pages in the system.
> +This command will try to adjust the number of default sized huge pages in the
> +huge page pool to 20, allocating or freeing huge pages, as required.
> +
>  On a NUMA platform, the kernel will attempt to distribute the huge page pool
> -over the all on-line nodes.  These huge pages, allocated when nr_hugepages
> -is increased, are called "persistent huge pages".
> +over the all the nodes specified by the NUMA memory policy of the task that

Remove the first 'the'.

> +modifies nr_hugepages that contain sufficient available contiguous memory.
> +These nodes are called the huge pages "allowed nodes".  The default for the

Not sure if you need to spell out that they're called "huge page allowed 
nodes," isn't that an implementation detail?  The way Paul Jackson used to 
describe nodes_allowed is "set of allowable nodes," and I can't think of a 
better phrase.  That's also how the cpuset documentation describes them.

> +huge pages allowed nodes--when the task has default memory policy--is all
> +on-line nodes.  See the discussion below of the interaction of task memory

All online nodes with memory, right?

> +policy, cpusets and per node attributes with the allocation and freeing of
> +persistent huge pages.
>  
>  The success or failure of huge page allocation depends on the amount of
>  physically contiguous memory that is preset in system at the time of the
> @@ -101,11 +103,11 @@ some nodes in a NUMA system, it will att
>  allocating extra pages on other nodes with sufficient available contiguous
>  memory, if any.
>  
> -System administrators may want to put this command in one of the local rc init
> -files.  This will enable the kernel to request huge pages early in the boot
> -process when the possibility of getting physical contiguous pages is still
> -very high.  Administrators can verify the number of huge pages actually
> -allocated by checking the sysctl or meminfo.  To check the per node
> +System administrators may want to put this command in one of the local rc
> +init files.  This will enable the kernel to preallocate huge pages early in
> +the boot process when the possibility of getting physical contiguous pages
> +is still very high.  Administrators can verify the number of huge pages
> +actually allocated by checking the sysctl or meminfo.  To check the per node
>  distribution of huge pages in a NUMA system, use:
>  
>  	cat /sys/devices/system/node/node*/meminfo | fgrep Huge
> @@ -113,39 +115,40 @@ distribution of huge pages in a NUMA sys
>  /proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
>  huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
>  requested by applications.  Writing any non-zero value into this file
> -indicates that the hugetlb subsystem is allowed to try to obtain "surplus"
> -huge pages from the buddy allocator, when the normal pool is exhausted. As
> -these surplus huge pages go out of use, they are freed back to the buddy
> -allocator.
> +indicates that the hugetlb subsystem is allowed to try to obtain that
> +number of "surplus" huge pages from the kernel's normal page pool, when the
> +persistent huge page pool is exhausted. As these surplus huge pages become
> +unused, they are freed back to the kernel's normal page pool.
>  
> -When increasing the huge page pool size via nr_hugepages, any surplus
> +When increasing the huge page pool size via nr_hugepages, any existing surplus
>  pages will first be promoted to persistent huge pages.  Then, additional
>  huge pages will be allocated, if necessary and if possible, to fulfill
> -the new huge page pool size.
> +the new persistent huge page pool size.
>  
>  The administrator may shrink the pool of preallocated huge pages for
>  the default huge page size by setting the nr_hugepages sysctl to a
>  smaller value.  The kernel will attempt to balance the freeing of huge pages
> -across all on-line nodes.  Any free huge pages on the selected nodes will
> -be freed back to the buddy allocator.
> -
> -Caveat: Shrinking the pool via nr_hugepages such that it becomes less
> -than the number of huge pages in use will convert the balance to surplus
> -huge pages even if it would exceed the overcommit value.  As long as
> -this condition holds, however, no more surplus huge pages will be
> -allowed on the system until one of the two sysctls are increased
> -sufficiently, or the surplus huge pages go out of use and are freed.
> +across all nodes in the memory policy of the task modifying nr_hugepages.
> +Any free huge pages on the selected nodes will be freed back to the kernel's
> +normal page pool.
> +
> +Caveat: Shrinking the persistent huge page pool via nr_hugepages such that
> +it becomes less than the number of huge pages in use will convert the balance
> +of the in-use huge pages to surplus huge pages.  This will occur even if
> +the number of surplus pages it would exceed the overcommit value.  As long as
> +this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is
> +increased sufficiently, or the surplus huge pages go out of use and are freed--
> +no more surplus huge pages will be allowed to be allocated.
>  

Nice description!

>  With support for multiple huge page pools at run-time available, much of
> -the huge page userspace interface has been duplicated in sysfs. The above
> -information applies to the default huge page size which will be
> -controlled by the /proc interfaces for backwards compatibility. The root
> -huge page control directory in sysfs is:
> +the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs.
> +The /proc interfaces discussed above have been retained for backwards
> +compatibility. The root huge page control directory in sysfs is:
>  
>  	/sys/kernel/mm/hugepages
>  
>  For each huge page size supported by the running kernel, a subdirectory
> -will exist, of the form
> +will exist, of the form:
>  
>  	hugepages-${size}kB
>  
> @@ -159,6 +162,98 @@ Inside each of these directories, the sa
>  
>  which function as described above for the default huge page-sized case.
>  
> +
> +Interaction of Task Memory Policy with Huge Page Allocation/Freeing:
> +
> +Whether huge pages are allocated and freed via the /proc interface or
> +the /sysfs interface, the NUMA nodes from which huge pages are allocated
> +or freed are controlled by the NUMA memory policy of the task that modifies
> +the nr_hugepages parameter.  [nr_overcommit_hugepages is a global limit.]
> +
> +The recommended method to allocate or free huge pages to/from the kernel
> +huge page pool, using the nr_hugepages example above, is:
> +
> +    numactl --interleave <node-list> echo 20 >/proc/sys/vm/nr_hugepages.
> +
> +or, more succinctly:
> +
> +    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages.
> +
> +This will allocate or free abs(20 - nr_hugepages) to or from the nodes
> +specified in <node-list>, depending on whether nr_hugepages is initially
> +less than or greater than 20, respectively.  No huge pages will be
> +allocated nor freed on any node not included in the specified <node-list>.
> +

This is actually why I was against the mempolicy approach to begin with: 
applications currently can free all hugepages on the system simply by 
writing to nr_hugepages, regardless of their mempolicy.  It's now possible 
that hugepages will remain allocated because they are on nodes disjoint 
from current->mempolicy->v.nodes.  I hope the advantages of this approach 
outweigh the potential userspace breakage of existing applications.

> +Any memory policy mode--bind, preferred, local or interleave--may be
> +used.  The effect on persistent huge page allocation will be as follows:
> +
> +1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
> +   persistent huge pages will be distributed across the node or nodes
> +   specified in the mempolicy as if "interleave" had been specified.
> +   However, if a node in the policy does not contain sufficient contiguous
> +   memory for a huge page, the allocation will not "fallback" to the nearest
> +   neighbor node with sufficient contiguous memory.  To do this would cause
> +   undesirable imbalance in the distribution of the huge page pool, or
> +   possibly, allocation of persistent huge pages on nodes not allowed by
> +   the task's memory policy.
> +

This is a good example of why the per-node tunables are helpful in case 
such a fallback is desired.

> +2) One or more nodes may be specified with the bind or interleave policy.
> +   If more than one node is specified with the preferred policy, only the
> +   lowest numeric id will be used.  Local policy will select the node where
> +   the task is running at the time the nodes_allowed mask is constructed.
> +
> +3) For local policy to be deterministic, the task must be bound to a cpu or
> +   cpus in a single node.  Otherwise, the task could be migrated to some
> +   other node at any time after launch and the resulting node will be
> +   indeterminate.  Thus, local policy is not very useful for this purpose.
> +   Any of the other mempolicy modes may be used to specify a single node.
> +
> +4) The nodes allowed mask will be derived from any non-default task mempolicy,
> +   whether this policy was set explicitly by the task itself or one of its
> +   ancestors, such as numactl.  This means that if the task is invoked from a
> +   shell with non-default policy, that policy will be used.  One can specify a
> +   node list of "all" with numactl --interleave or --membind [-m] to achieve
> +   interleaving over all nodes in the system or cpuset.
> +

Nice description.

> +5) Any task mempolicy specifed--e.g., using numactl--will be constrained by
> +   the resource limits of any cpuset in which the task runs.  Thus, there will
> +   be no way for a task with non-default policy running in a cpuset with a
> +   subset of the system nodes to allocate huge pages outside the cpuset
> +   without first moving to a cpuset that contains all of the desired nodes.
> +
> +6) Hugepages allocated at boot time always use the node_online_map.

Implementation detail in the name, maybe just say "all online nodes with 
memory"?

> +
> +
> +Per Node Hugepages Attributes
> +
> +A subset of the contents of the root huge page control directory in sysfs,
> +described above, has been replicated under each "node" system device in:
> +
> +	/sys/devices/system/node/node[0-9]*/hugepages/
> +
> +Under this directory, the subdirectory for each supported huge page size
> +contains the following attribute files:
> +
> +	nr_hugepages
> +	free_hugepages
> +	surplus_hugepages
> +
> +The free_' and surplus_' attribute files are read-only.  They return the number
> +of free and surplus [overcommitted] huge pages, respectively, on the parent
> +node.
> +
> +The nr_hugepages attribute will return the total number of huge pages on the
> +specified node.  When this attribute is written, the number of persistent huge
> +pages on the parent node will be adjusted to the specified value, if sufficient
> +resources exist, regardless of the task's mempolicy or cpuset constraints.
> +
> +Note that the number of overcommit and reserve pages remain global quantities,
> +as we don't know until fault time, when the faulting task's mempolicy is applied,
> +from which node the huge page allocation will be attempted.
> +
> +
> +Using Huge Pages:
> +
>  If the user applications are going to request huge pages using mmap system
>  call, then it is required that system administrator mount a file system of
>  type hugetlbfs:
> @@ -206,9 +301,11 @@ map_hugetlb.c.
>   * requesting huge pages.
>   *
>   * For the ia64 architecture, the Linux kernel reserves Region number 4 for
> - * huge pages.  That means the addresses starting with 0x800000... will need
> - * to be specified.  Specifying a fixed address is not required on ppc64,
> - * i386 or x86_64.
> + * huge pages.  That means that if one requires a fixed address, a huge page
> + * aligned address starting with 0x800000... will be required.  If a fixed
> + * address is not required, the kernel will select an address in the proper
> + * range.
> + * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
>   *
>   * Note: The default shared memory limit is quite low on many kernels,
>   * you may need to increase it via:
> @@ -237,14 +334,8 @@ map_hugetlb.c.
>  
>  #define dprintf(x)  printf(x)
>  
> -/* Only ia64 requires this */
> -#ifdef __ia64__
> -#define ADDR (void *)(0x8000000000000000UL)
> -#define SHMAT_FLAGS (SHM_RND)
> -#else
> -#define ADDR (void *)(0x0UL)
> +#define ADDR (void *)(0x0UL)	/* let kernel choose address */
>  #define SHMAT_FLAGS (0)
> -#endif
>  
>  int main(void)
>  {
> @@ -302,10 +393,12 @@ int main(void)
>   * example, the app is requesting memory of size 256MB that is backed by
>   * huge pages.
>   *
> - * For ia64 architecture, Linux kernel reserves Region number 4 for huge pages.
> - * That means the addresses starting with 0x800000... will need to be
> - * specified.  Specifying a fixed address is not required on ppc64, i386
> - * or x86_64.
> + * For the ia64 architecture, the Linux kernel reserves Region number 4 for
> + * huge pages.  That means that if one requires a fixed address, a huge page
> + * aligned address starting with 0x800000... will be required.  If a fixed
> + * address is not required, the kernel will select an address in the proper
> + * range.
> + * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
>   */
>  #include <stdlib.h>
>  #include <stdio.h>
> @@ -317,14 +410,8 @@ int main(void)
>  #define LENGTH (256UL*1024*1024)
>  #define PROTECTION (PROT_READ | PROT_WRITE)
>  
> -/* Only ia64 requires this */
> -#ifdef __ia64__
> -#define ADDR (void *)(0x8000000000000000UL)
> -#define FLAGS (MAP_SHARED | MAP_FIXED)
> -#else
> -#define ADDR (void *)(0x0UL)
> +#define ADDR (void *)(0x0UL)	/* let kernel choose address */
>  #define FLAGS (MAP_SHARED)
> -#endif
>  
>  void check_bytes(char *addr)
>  {
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] hugetlb:  derive huge pages nodes allowed from task mempolicy
  2009-09-03 19:22     ` David Rientjes
@ 2009-09-03 20:15       ` Lee Schermerhorn
  -1 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-09-03 20:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, akpm, Mel Gorman, Nishanth Aravamudan, linux-numa,
	Adam Litke, Andy Whitcroft, eric.whitney

On Thu, 2009-09-03 at 12:22 -0700, David Rientjes wrote:
> On Fri, 28 Aug 2009, Lee Schermerhorn wrote:
> 
> > Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/mempolicy.c
> > ===================================================================
> > --- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/mempolicy.c	2009-08-28 09:21:20.000000000 -0400
> > +++ linux-2.6.31-rc7-mmotm-090827-0057/mm/mempolicy.c	2009-08-28 09:21:28.000000000 -0400
> > @@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm
> >  	}
> >  	return zl;
> >  }
> > +
> > +/*
> > + * huge_mpol_nodes_allowed -- mempolicy extension for huge pages.
> > + *
> > + * Returns a [pointer to a] nodelist based on the current task's mempolicy
> > + * to constraing the allocation and freeing of persistent huge pages
> > + * 'Preferred', 'local' and 'interleave' mempolicy will behave more like
> > + * 'bind' policy in this context.  An attempt to allocate a persistent huge
> > + * page will never "fallback" to another node inside the buddy system
> > + * allocator.
> > + *
> > + * If the task's mempolicy is "default" [NULL], just return NULL for
> > + * default behavior.  Otherwise, extract the policy nodemask for 'bind'
> > + * or 'interleave' policy or construct a nodemask for 'preferred' or
> > + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t.
> > + *
> > + * N.B., it is the caller's responsibility to free a returned nodemask.
> > + */
> 
> This isn't limited to only hugepage code, so a more appropriate name would 
> probably be better.

Currently, this function is very much limited only to hugepage code.
Most [all?] other users of mempolicy just use the alloc_vma_pages() and
company, w/o cracking open the mempolicy.   I suppose something might
come along that wants to open code interleaving over this mask, the way
hugepage code does.  We could generalize it, then.  However, I'm not
opposed to changing it to something like
"alloc_nodemask_of_mempolicy()".   I still want to keep it in
mempolicy.c, tho'.

Would this work for you?
  
> 
> It'd probably be better to check for a NULL nodes_allowed either in 
> set_max_huge_pages() than in hstate_next_node_to_{alloc,free} just for the 
> cleanliness of the code OR simply return node_online_map from this 
> function for default policies.

Yeah, I could pull the test up there to right after we check for a node
id or task policy, and assign a pointer to node_online_map to
nodes_allowed.  Then, I'll have to test for that condition before
calling kfree().  I have no strong feelings about this.   I'll try to
get this done for V6.  I'd like to get that out this week.

> 
> Otherwise
> 
> Acked-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] hugetlb:  derive huge pages nodes allowed from task mempolicy
@ 2009-09-03 20:15       ` Lee Schermerhorn
  0 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-09-03 20:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, akpm, Mel Gorman, Nishanth Aravamudan, linux-numa,
	Adam Litke, Andy Whitcroft, eric.whitney

On Thu, 2009-09-03 at 12:22 -0700, David Rientjes wrote:
> On Fri, 28 Aug 2009, Lee Schermerhorn wrote:
> 
> > Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/mempolicy.c
> > ===================================================================
> > --- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/mempolicy.c	2009-08-28 09:21:20.000000000 -0400
> > +++ linux-2.6.31-rc7-mmotm-090827-0057/mm/mempolicy.c	2009-08-28 09:21:28.000000000 -0400
> > @@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm
> >  	}
> >  	return zl;
> >  }
> > +
> > +/*
> > + * huge_mpol_nodes_allowed -- mempolicy extension for huge pages.
> > + *
> > + * Returns a [pointer to a] nodelist based on the current task's mempolicy
> > + * to constraing the allocation and freeing of persistent huge pages
> > + * 'Preferred', 'local' and 'interleave' mempolicy will behave more like
> > + * 'bind' policy in this context.  An attempt to allocate a persistent huge
> > + * page will never "fallback" to another node inside the buddy system
> > + * allocator.
> > + *
> > + * If the task's mempolicy is "default" [NULL], just return NULL for
> > + * default behavior.  Otherwise, extract the policy nodemask for 'bind'
> > + * or 'interleave' policy or construct a nodemask for 'preferred' or
> > + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t.
> > + *
> > + * N.B., it is the caller's responsibility to free a returned nodemask.
> > + */
> 
> This isn't limited to only hugepage code, so a more appropriate name would 
> probably be better.

Currently, this function is very much limited only to hugepage code.
Most [all?] other users of mempolicy just use the alloc_vma_pages() and
company, w/o cracking open the mempolicy.   I suppose something might
come along that wants to open code interleaving over this mask, the way
hugepage code does.  We could generalize it, then.  However, I'm not
opposed to changing it to something like
"alloc_nodemask_of_mempolicy()".   I still want to keep it in
mempolicy.c, tho'.

Would this work for you?
  
> 
> It'd probably be better to check for a NULL nodes_allowed either in 
> set_max_huge_pages() than in hstate_next_node_to_{alloc,free} just for the 
> cleanliness of the code OR simply return node_online_map from this 
> function for default policies.

Yeah, I could pull the test up there to right after we check for a node
id or task policy, and assign a pointer to node_online_map to
nodes_allowed.  Then, I'll have to test for that condition before
calling kfree().  I have no strong feelings about this.   I'll try to
get this done for V6.  I'd like to get that out this week.

> 
> Otherwise
> 
> Acked-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 5/6] hugetlb:  add per node hstate attributes
  2009-09-03 19:52     ` David Rientjes
@ 2009-09-03 20:41       ` Lee Schermerhorn
  -1 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-09-03 20:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Andrew Morton, Mel Gorman, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

On Thu, 2009-09-03 at 12:52 -0700, David Rientjes wrote:
> On Fri, 28 Aug 2009, Lee Schermerhorn wrote:
> 
> > Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c
> > ===================================================================
> > --- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/hugetlb.c	2009-08-28 09:21:28.000000000 -0400
> > +++ linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c	2009-08-28 09:21:31.000000000 -0400
> > @@ -24,6 +24,7 @@
> >  #include <asm/io.h>
> >  
> >  #include <linux/hugetlb.h>
> > +#include <linux/node.h>
> >  #include "internal.h"
> >  
> >  const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> > @@ -1245,7 +1246,8 @@ static int adjust_pool_surplus(struct hs
> >  }
> >  
> >  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> > -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> > +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> > +								int nid)
> >  {
> >  	unsigned long min_count, ret;
> >  	nodemask_t *nodes_allowed;
> > @@ -1253,7 +1255,21 @@ static unsigned long set_max_huge_pages(
> >  	if (h->order >= MAX_ORDER)
> >  		return h->max_huge_pages;
> >  
> > -	nodes_allowed = huge_mpol_nodes_allowed();
> > +	if (nid == NO_NODEID_SPECIFIED)
> > +		nodes_allowed = huge_mpol_nodes_allowed();
> > +	else {
> > +		/*
> > +		 * incoming 'count' is for node 'nid' only, so
> > +		 * adjust count to global, but restrict alloc/free
> > +		 * to the specified node.
> > +		 */
> > +		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> > +		nodes_allowed = alloc_nodemask_of_node(nid);
> > +		if (!nodes_allowed)
> > +			printk(KERN_WARNING "%s unable to allocate allowed "
> > +			       "nodes mask for huge page allocation/free.  "
> > +			       "Falling back to default.\n", current->comm);
> > +	}
> >  
> >  	/*
> >  	 * Increase the pool size
> > @@ -1329,51 +1345,71 @@ out:
> >  static struct kobject *hugepages_kobj;
> >  static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
> >  
> > -static struct hstate *kobj_to_hstate(struct kobject *kobj)
> > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp);
> > +
> > +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
> >  {
> >  	int i;
> > +
> >  	for (i = 0; i < HUGE_MAX_HSTATE; i++)
> > -		if (hstate_kobjs[i] == kobj)
> > +		if (hstate_kobjs[i] == kobj) {
> > +			if (nidp)
> > +				*nidp = NO_NODEID_SPECIFIED;
> >  			return &hstates[i];
> > -	BUG();
> > -	return NULL;
> > +		}
> > +
> > +	return kobj_to_node_hstate(kobj, nidp);
> >  }
> >  
> >  static ssize_t nr_hugepages_show(struct kobject *kobj,
> >  					struct kobj_attribute *attr, char *buf)
> >  {
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > -	return sprintf(buf, "%lu\n", h->nr_huge_pages);
> > +	struct hstate *h;
> > +	unsigned long nr_huge_pages;
> > +	int nid;
> > +
> > +	h = kobj_to_hstate(kobj, &nid);
> > +	if (nid == NO_NODEID_SPECIFIED)
> > +		nr_huge_pages = h->nr_huge_pages;
> > +	else
> > +		nr_huge_pages = h->nr_huge_pages_node[nid];
> > +
> > +	return sprintf(buf, "%lu\n", nr_huge_pages);
> >  }
> > +
> >  static ssize_t nr_hugepages_store(struct kobject *kobj,
> > -		struct kobj_attribute *attr, const char *buf, size_t count)
> > +		struct kobj_attribute *attr, const char *buf, size_t len)
> >  {
> > +	unsigned long count;
> > +	struct hstate *h;
> > +	int nid;
> >  	int err;
> > -	unsigned long input;
> > -	struct hstate *h = kobj_to_hstate(kobj);
> >  
> > -	err = strict_strtoul(buf, 10, &input);
> > +	err = strict_strtoul(buf, 10, &count);
> >  	if (err)
> >  		return 0;
> >  
> > -	h->max_huge_pages = set_max_huge_pages(h, input);
> > +	h = kobj_to_hstate(kobj, &nid);
> > +	h->max_huge_pages = set_max_huge_pages(h, count, nid);
> >  
> > -	return count;
> > +	return len;
> >  }
> >  HSTATE_ATTR(nr_hugepages);
> >  
> >  static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
> >  					struct kobj_attribute *attr, char *buf)
> >  {
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> > +
> >  	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
> >  }
> > +
> >  static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
> >  		struct kobj_attribute *attr, const char *buf, size_t count)
> >  {
> >  	int err;
> >  	unsigned long input;
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> >  
> >  	err = strict_strtoul(buf, 10, &input);
> >  	if (err)
> > @@ -1390,15 +1426,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
> >  static ssize_t free_hugepages_show(struct kobject *kobj,
> >  					struct kobj_attribute *attr, char *buf)
> >  {
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > -	return sprintf(buf, "%lu\n", h->free_huge_pages);
> > +	struct hstate *h;
> > +	unsigned long free_huge_pages;
> > +	int nid;
> > +
> > +	h = kobj_to_hstate(kobj, &nid);
> > +	if (nid == NO_NODEID_SPECIFIED)
> > +		free_huge_pages = h->free_huge_pages;
> > +	else
> > +		free_huge_pages = h->free_huge_pages_node[nid];
> > +
> > +	return sprintf(buf, "%lu\n", free_huge_pages);
> >  }
> >  HSTATE_ATTR_RO(free_hugepages);
> >  
> >  static ssize_t resv_hugepages_show(struct kobject *kobj,
> >  					struct kobj_attribute *attr, char *buf)
> >  {
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> >  	return sprintf(buf, "%lu\n", h->resv_huge_pages);
> >  }
> >  HSTATE_ATTR_RO(resv_hugepages);
> > @@ -1406,8 +1451,17 @@ HSTATE_ATTR_RO(resv_hugepages);
> >  static ssize_t surplus_hugepages_show(struct kobject *kobj,
> >  					struct kobj_attribute *attr, char *buf)
> >  {
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > -	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
> > +	struct hstate *h;
> > +	unsigned long surplus_huge_pages;
> > +	int nid;
> > +
> > +	h = kobj_to_hstate(kobj, &nid);
> > +	if (nid == NO_NODEID_SPECIFIED)
> > +		surplus_huge_pages = h->surplus_huge_pages;
> > +	else
> > +		surplus_huge_pages = h->surplus_huge_pages_node[nid];
> > +
> > +	return sprintf(buf, "%lu\n", surplus_huge_pages);
> >  }
> >  HSTATE_ATTR_RO(surplus_hugepages);
> >  
> > @@ -1424,19 +1478,21 @@ static struct attribute_group hstate_att
> >  	.attrs = hstate_attrs,
> >  };
> >  
> > -static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
> > +static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
> > +				struct kobject *parent,
> > +				struct kobject **hstate_kobjs,
> > +				struct attribute_group *hstate_attr_group)
> >  {
> >  	int retval;
> > +	int hi = h - hstates;
> >  
> > -	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
> > -							hugepages_kobj);
> > -	if (!hstate_kobjs[h - hstates])
> > +	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
> > +	if (!hstate_kobjs[hi])
> >  		return -ENOMEM;
> >  
> > -	retval = sysfs_create_group(hstate_kobjs[h - hstates],
> > -							&hstate_attr_group);
> > +	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
> >  	if (retval)
> > -		kobject_put(hstate_kobjs[h - hstates]);
> > +		kobject_put(hstate_kobjs[hi]);
> >  
> >  	return retval;
> >  }
> > @@ -1451,17 +1507,143 @@ static void __init hugetlb_sysfs_init(vo
> >  		return;
> >  
> >  	for_each_hstate(h) {
> > -		err = hugetlb_sysfs_add_hstate(h);
> > +		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
> > +					 hstate_kobjs, &hstate_attr_group);
> >  		if (err)
> >  			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
> >  								h->name);
> >  	}
> >  }
> >  
> > +#ifdef CONFIG_NUMA
> > +
> > +struct node_hstate {
> > +	struct kobject		*hugepages_kobj;
> > +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
> > +};
> > +struct node_hstate node_hstates[MAX_NUMNODES];
> > +
> > +static struct attribute *per_node_hstate_attrs[] = {
> > +	&nr_hugepages_attr.attr,
> > +	&free_hugepages_attr.attr,
> > +	&surplus_hugepages_attr.attr,
> > +	NULL,
> > +};
> > +
> > +static struct attribute_group per_node_hstate_attr_group = {
> > +	.attrs = per_node_hstate_attrs,
> > +};
> > +
> > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> > +{
> > +	int nid;
> > +
> > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > +		struct node_hstate *nhs = &node_hstates[nid];
> > +		int i;
> > +		for (i = 0; i < HUGE_MAX_HSTATE; i++)
> > +			if (nhs->hstate_kobjs[i] == kobj) {
> > +				if (nidp)
> > +					*nidp = nid;
> > +				return &hstates[i];
> > +			}
> > +	}
> > +
> > +	BUG();
> > +	return NULL;
> > +}
> > +
> > +void hugetlb_unregister_node(struct node *node)
> > +{
> > +	struct hstate *h;
> > +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> > +
> > +	if (!nhs->hugepages_kobj)
> > +		return;
> > +
> > +	for_each_hstate(h)
> > +		if (nhs->hstate_kobjs[h - hstates]) {
> > +			kobject_put(nhs->hstate_kobjs[h - hstates]);
> > +			nhs->hstate_kobjs[h - hstates] = NULL;
> > +		}
> > +
> > +	kobject_put(nhs->hugepages_kobj);
> > +	nhs->hugepages_kobj = NULL;
> > +}
> > +
> > +static void hugetlb_unregister_all_nodes(void)
> > +{
> > +	int nid;
> > +
> > +	for (nid = 0; nid < nr_node_ids; nid++)
> > +		hugetlb_unregister_node(&node_devices[nid]);
> > +
> > +	register_hugetlbfs_with_node(NULL, NULL);
> > +}
> > +
> > +void hugetlb_register_node(struct node *node)
> > +{
> > +	struct hstate *h;
> > +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> > +	int err;
> > +
> > +	if (nhs->hugepages_kobj)
> > +		return;		/* already allocated */
> > +
> > +	nhs->hugepages_kobj = kobject_create_and_add("hugepages",
> > +							&node->sysdev.kobj);
> > +	if (!nhs->hugepages_kobj)
> > +		return;
> > +
> > +	for_each_hstate(h) {
> > +		err = hugetlb_sysfs_add_hstate(h, nhs->hugepages_kobj,
> > +						nhs->hstate_kobjs,
> > +						&per_node_hstate_attr_group);
> > +		if (err) {
> > +			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
> > +					" for node %d\n",
> > +						h->name, node->sysdev.id);
> 
> Maybe add `err' to the printk so we know whether it was an -ENOMEM 
> condition or sysfs problem?

Just the raw negative number?

> 
> > +			hugetlb_unregister_node(node);
> > +			break;
> > +		}
> > +	}
> > +}
> > +
> > +static void hugetlb_register_all_nodes(void)
> > +{
> > +	int nid;
> > +
> > +	for (nid = 0; nid < nr_node_ids; nid++) {
> 
> Don't you want to do this for all nodes in N_HIGH_MEMORY?  I don't think 
> we should be adding attributes for memoryless nodes.


Well, I wondered about that.  The persistent huge page allocation code
is careful to skip over nodes where it can't allow a huge page there,
whether or not the node has [any] memory.  So, it's safe to leave it
this way.  And, I was worried about the interaction with memory hotplug,
as separate from node hotplug.  The current code handles node hot plug,
bug I wasn't sure about memory hot-plug within a node.  I.e., would the
node driver get called to register the attributes in this case?   Maybe
that case doesn't exist, so I don't have to worry about it.   I think
this is somewhat similar to the top cpuset mems_allowed being set to all
possible to cover any subsequently added nodes/memory.

It's easy to change to use node_state[N_HIGH_MEMORY], but then the
memory hotplug guys might jump in and say that, now it doesn't work for
memory hot add/remove.  I really don't want to go there...

But, I can understand the confusion about providing an explicit control
like this for a node without memory.  It is somewhat different from just
visiting all online nodes for the default mask, as hugetlb.c has always
done.
  
> 
> > +		struct node *node = &node_devices[nid];
> > +		if (node->sysdev.id == nid)
> > +			hugetlb_register_node(node);
> > +	}
> > +
> > +	register_hugetlbfs_with_node(hugetlb_register_node,
> > +                                     hugetlb_unregister_node);
> > +}
> > +#else	/* !CONFIG_NUMA */
> > +
> > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> > +{
> > +	BUG();
> > +	if (nidp)
> > +		*nidp = -1;
> > +	return NULL;
> > +}
> > +
> > +static void hugetlb_unregister_all_nodes(void) { }
> > +
> > +static void hugetlb_register_all_nodes(void) { }
> > +
> > +#endif
> > +
> >  static void __exit hugetlb_exit(void)
> >  {
> >  	struct hstate *h;
> >  
> > +	hugetlb_unregister_all_nodes();
> > +
> >  	for_each_hstate(h) {
> >  		kobject_put(hstate_kobjs[h - hstates]);
> >  	}
> > @@ -1496,6 +1678,8 @@ static int __init hugetlb_init(void)
> >  
> >  	hugetlb_sysfs_init();
> >  
> > +	hugetlb_register_all_nodes();
> > +
> >  	return 0;
> >  }
> >  module_init(hugetlb_init);
> > @@ -1598,7 +1782,8 @@ int hugetlb_sysctl_handler(struct ctl_ta
> >  	proc_doulongvec_minmax(table, write, buffer, length, ppos);
> >  
> >  	if (write)
> > -		h->max_huge_pages = set_max_huge_pages(h, tmp);
> > +		h->max_huge_pages = set_max_huge_pages(h, tmp,
> > +		                                       NO_NODEID_SPECIFIED);
> >  
> >  	return 0;
> >  }
> > Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h
> > ===================================================================
> > --- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/numa.h	2009-08-28 09:21:17.000000000 -0400
> > +++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h	2009-08-28 09:21:31.000000000 -0400
> > @@ -10,4 +10,6 @@
> >  
> >  #define MAX_NUMNODES    (1 << NODES_SHIFT)
> >  
> > +#define NO_NODEID_SPECIFIED	(-1)
> > +
> >  #endif /* _LINUX_NUMA_H */
> 
> Hmm, so we already have NUMA_NO_NODE in the ia64 and x86_64 code and 
> NID_INVAL in the ACPI code, both of which are defined to -1.  Maybe rename 
> your addition here in favor of NUMA_NO_NODE, remove it from the ia64 and 
> x86 arch headers, and convert the ACPI code?

OK, replacing 'NO_NODEID_SPECIFIED' with 'NUMA_NO_NODE,' works w/o
decending into header dependency hell.  The symbol is already visible in
hugetlb.c  I'll fix that.  But, ACPI?  Not today, thanks :).  

> 
> Thanks for doing this!

Well, we do need this, as well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 5/6] hugetlb:  add per node hstate attributes
@ 2009-09-03 20:41       ` Lee Schermerhorn
  0 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-09-03 20:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Andrew Morton, Mel Gorman, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

On Thu, 2009-09-03 at 12:52 -0700, David Rientjes wrote:
> On Fri, 28 Aug 2009, Lee Schermerhorn wrote:
> 
> > Index: linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c
> > ===================================================================
> > --- linux-2.6.31-rc7-mmotm-090827-0057.orig/mm/hugetlb.c	2009-08-28 09:21:28.000000000 -0400
> > +++ linux-2.6.31-rc7-mmotm-090827-0057/mm/hugetlb.c	2009-08-28 09:21:31.000000000 -0400
> > @@ -24,6 +24,7 @@
> >  #include <asm/io.h>
> >  
> >  #include <linux/hugetlb.h>
> > +#include <linux/node.h>
> >  #include "internal.h"
> >  
> >  const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> > @@ -1245,7 +1246,8 @@ static int adjust_pool_surplus(struct hs
> >  }
> >  
> >  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> > -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> > +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> > +								int nid)
> >  {
> >  	unsigned long min_count, ret;
> >  	nodemask_t *nodes_allowed;
> > @@ -1253,7 +1255,21 @@ static unsigned long set_max_huge_pages(
> >  	if (h->order >= MAX_ORDER)
> >  		return h->max_huge_pages;
> >  
> > -	nodes_allowed = huge_mpol_nodes_allowed();
> > +	if (nid == NO_NODEID_SPECIFIED)
> > +		nodes_allowed = huge_mpol_nodes_allowed();
> > +	else {
> > +		/*
> > +		 * incoming 'count' is for node 'nid' only, so
> > +		 * adjust count to global, but restrict alloc/free
> > +		 * to the specified node.
> > +		 */
> > +		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> > +		nodes_allowed = alloc_nodemask_of_node(nid);
> > +		if (!nodes_allowed)
> > +			printk(KERN_WARNING "%s unable to allocate allowed "
> > +			       "nodes mask for huge page allocation/free.  "
> > +			       "Falling back to default.\n", current->comm);
> > +	}
> >  
> >  	/*
> >  	 * Increase the pool size
> > @@ -1329,51 +1345,71 @@ out:
> >  static struct kobject *hugepages_kobj;
> >  static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
> >  
> > -static struct hstate *kobj_to_hstate(struct kobject *kobj)
> > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp);
> > +
> > +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
> >  {
> >  	int i;
> > +
> >  	for (i = 0; i < HUGE_MAX_HSTATE; i++)
> > -		if (hstate_kobjs[i] == kobj)
> > +		if (hstate_kobjs[i] == kobj) {
> > +			if (nidp)
> > +				*nidp = NO_NODEID_SPECIFIED;
> >  			return &hstates[i];
> > -	BUG();
> > -	return NULL;
> > +		}
> > +
> > +	return kobj_to_node_hstate(kobj, nidp);
> >  }
> >  
> >  static ssize_t nr_hugepages_show(struct kobject *kobj,
> >  					struct kobj_attribute *attr, char *buf)
> >  {
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > -	return sprintf(buf, "%lu\n", h->nr_huge_pages);
> > +	struct hstate *h;
> > +	unsigned long nr_huge_pages;
> > +	int nid;
> > +
> > +	h = kobj_to_hstate(kobj, &nid);
> > +	if (nid == NO_NODEID_SPECIFIED)
> > +		nr_huge_pages = h->nr_huge_pages;
> > +	else
> > +		nr_huge_pages = h->nr_huge_pages_node[nid];
> > +
> > +	return sprintf(buf, "%lu\n", nr_huge_pages);
> >  }
> > +
> >  static ssize_t nr_hugepages_store(struct kobject *kobj,
> > -		struct kobj_attribute *attr, const char *buf, size_t count)
> > +		struct kobj_attribute *attr, const char *buf, size_t len)
> >  {
> > +	unsigned long count;
> > +	struct hstate *h;
> > +	int nid;
> >  	int err;
> > -	unsigned long input;
> > -	struct hstate *h = kobj_to_hstate(kobj);
> >  
> > -	err = strict_strtoul(buf, 10, &input);
> > +	err = strict_strtoul(buf, 10, &count);
> >  	if (err)
> >  		return 0;
> >  
> > -	h->max_huge_pages = set_max_huge_pages(h, input);
> > +	h = kobj_to_hstate(kobj, &nid);
> > +	h->max_huge_pages = set_max_huge_pages(h, count, nid);
> >  
> > -	return count;
> > +	return len;
> >  }
> >  HSTATE_ATTR(nr_hugepages);
> >  
> >  static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
> >  					struct kobj_attribute *attr, char *buf)
> >  {
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> > +
> >  	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
> >  }
> > +
> >  static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
> >  		struct kobj_attribute *attr, const char *buf, size_t count)
> >  {
> >  	int err;
> >  	unsigned long input;
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> >  
> >  	err = strict_strtoul(buf, 10, &input);
> >  	if (err)
> > @@ -1390,15 +1426,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
> >  static ssize_t free_hugepages_show(struct kobject *kobj,
> >  					struct kobj_attribute *attr, char *buf)
> >  {
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > -	return sprintf(buf, "%lu\n", h->free_huge_pages);
> > +	struct hstate *h;
> > +	unsigned long free_huge_pages;
> > +	int nid;
> > +
> > +	h = kobj_to_hstate(kobj, &nid);
> > +	if (nid == NO_NODEID_SPECIFIED)
> > +		free_huge_pages = h->free_huge_pages;
> > +	else
> > +		free_huge_pages = h->free_huge_pages_node[nid];
> > +
> > +	return sprintf(buf, "%lu\n", free_huge_pages);
> >  }
> >  HSTATE_ATTR_RO(free_hugepages);
> >  
> >  static ssize_t resv_hugepages_show(struct kobject *kobj,
> >  					struct kobj_attribute *attr, char *buf)
> >  {
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> >  	return sprintf(buf, "%lu\n", h->resv_huge_pages);
> >  }
> >  HSTATE_ATTR_RO(resv_hugepages);
> > @@ -1406,8 +1451,17 @@ HSTATE_ATTR_RO(resv_hugepages);
> >  static ssize_t surplus_hugepages_show(struct kobject *kobj,
> >  					struct kobj_attribute *attr, char *buf)
> >  {
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > -	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
> > +	struct hstate *h;
> > +	unsigned long surplus_huge_pages;
> > +	int nid;
> > +
> > +	h = kobj_to_hstate(kobj, &nid);
> > +	if (nid == NO_NODEID_SPECIFIED)
> > +		surplus_huge_pages = h->surplus_huge_pages;
> > +	else
> > +		surplus_huge_pages = h->surplus_huge_pages_node[nid];
> > +
> > +	return sprintf(buf, "%lu\n", surplus_huge_pages);
> >  }
> >  HSTATE_ATTR_RO(surplus_hugepages);
> >  
> > @@ -1424,19 +1478,21 @@ static struct attribute_group hstate_att
> >  	.attrs = hstate_attrs,
> >  };
> >  
> > -static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
> > +static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
> > +				struct kobject *parent,
> > +				struct kobject **hstate_kobjs,
> > +				struct attribute_group *hstate_attr_group)
> >  {
> >  	int retval;
> > +	int hi = h - hstates;
> >  
> > -	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
> > -							hugepages_kobj);
> > -	if (!hstate_kobjs[h - hstates])
> > +	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
> > +	if (!hstate_kobjs[hi])
> >  		return -ENOMEM;
> >  
> > -	retval = sysfs_create_group(hstate_kobjs[h - hstates],
> > -							&hstate_attr_group);
> > +	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
> >  	if (retval)
> > -		kobject_put(hstate_kobjs[h - hstates]);
> > +		kobject_put(hstate_kobjs[hi]);
> >  
> >  	return retval;
> >  }
> > @@ -1451,17 +1507,143 @@ static void __init hugetlb_sysfs_init(vo
> >  		return;
> >  
> >  	for_each_hstate(h) {
> > -		err = hugetlb_sysfs_add_hstate(h);
> > +		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
> > +					 hstate_kobjs, &hstate_attr_group);
> >  		if (err)
> >  			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
> >  								h->name);
> >  	}
> >  }
> >  
> > +#ifdef CONFIG_NUMA
> > +
> > +struct node_hstate {
> > +	struct kobject		*hugepages_kobj;
> > +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
> > +};
> > +struct node_hstate node_hstates[MAX_NUMNODES];
> > +
> > +static struct attribute *per_node_hstate_attrs[] = {
> > +	&nr_hugepages_attr.attr,
> > +	&free_hugepages_attr.attr,
> > +	&surplus_hugepages_attr.attr,
> > +	NULL,
> > +};
> > +
> > +static struct attribute_group per_node_hstate_attr_group = {
> > +	.attrs = per_node_hstate_attrs,
> > +};
> > +
> > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> > +{
> > +	int nid;
> > +
> > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > +		struct node_hstate *nhs = &node_hstates[nid];
> > +		int i;
> > +		for (i = 0; i < HUGE_MAX_HSTATE; i++)
> > +			if (nhs->hstate_kobjs[i] == kobj) {
> > +				if (nidp)
> > +					*nidp = nid;
> > +				return &hstates[i];
> > +			}
> > +	}
> > +
> > +	BUG();
> > +	return NULL;
> > +}
> > +
> > +void hugetlb_unregister_node(struct node *node)
> > +{
> > +	struct hstate *h;
> > +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> > +
> > +	if (!nhs->hugepages_kobj)
> > +		return;
> > +
> > +	for_each_hstate(h)
> > +		if (nhs->hstate_kobjs[h - hstates]) {
> > +			kobject_put(nhs->hstate_kobjs[h - hstates]);
> > +			nhs->hstate_kobjs[h - hstates] = NULL;
> > +		}
> > +
> > +	kobject_put(nhs->hugepages_kobj);
> > +	nhs->hugepages_kobj = NULL;
> > +}
> > +
> > +static void hugetlb_unregister_all_nodes(void)
> > +{
> > +	int nid;
> > +
> > +	for (nid = 0; nid < nr_node_ids; nid++)
> > +		hugetlb_unregister_node(&node_devices[nid]);
> > +
> > +	register_hugetlbfs_with_node(NULL, NULL);
> > +}
> > +
> > +void hugetlb_register_node(struct node *node)
> > +{
> > +	struct hstate *h;
> > +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> > +	int err;
> > +
> > +	if (nhs->hugepages_kobj)
> > +		return;		/* already allocated */
> > +
> > +	nhs->hugepages_kobj = kobject_create_and_add("hugepages",
> > +							&node->sysdev.kobj);
> > +	if (!nhs->hugepages_kobj)
> > +		return;
> > +
> > +	for_each_hstate(h) {
> > +		err = hugetlb_sysfs_add_hstate(h, nhs->hugepages_kobj,
> > +						nhs->hstate_kobjs,
> > +						&per_node_hstate_attr_group);
> > +		if (err) {
> > +			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
> > +					" for node %d\n",
> > +						h->name, node->sysdev.id);
> 
> Maybe add `err' to the printk so we know whether it was an -ENOMEM 
> condition or sysfs problem?

Just the raw negative number?

> 
> > +			hugetlb_unregister_node(node);
> > +			break;
> > +		}
> > +	}
> > +}
> > +
> > +static void hugetlb_register_all_nodes(void)
> > +{
> > +	int nid;
> > +
> > +	for (nid = 0; nid < nr_node_ids; nid++) {
> 
> Don't you want to do this for all nodes in N_HIGH_MEMORY?  I don't think 
> we should be adding attributes for memoryless nodes.


Well, I wondered about that.  The persistent huge page allocation code
is careful to skip over nodes where it can't allow a huge page there,
whether or not the node has [any] memory.  So, it's safe to leave it
this way.  And, I was worried about the interaction with memory hotplug,
as separate from node hotplug.  The current code handles node hot plug,
bug I wasn't sure about memory hot-plug within a node.  I.e., would the
node driver get called to register the attributes in this case?   Maybe
that case doesn't exist, so I don't have to worry about it.   I think
this is somewhat similar to the top cpuset mems_allowed being set to all
possible to cover any subsequently added nodes/memory.

It's easy to change to use node_state[N_HIGH_MEMORY], but then the
memory hotplug guys might jump in and say that, now it doesn't work for
memory hot add/remove.  I really don't want to go there...

But, I can understand the confusion about providing an explicit control
like this for a node without memory.  It is somewhat different from just
visiting all online nodes for the default mask, as hugetlb.c has always
done.
  
> 
> > +		struct node *node = &node_devices[nid];
> > +		if (node->sysdev.id == nid)
> > +			hugetlb_register_node(node);
> > +	}
> > +
> > +	register_hugetlbfs_with_node(hugetlb_register_node,
> > +                                     hugetlb_unregister_node);
> > +}
> > +#else	/* !CONFIG_NUMA */
> > +
> > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> > +{
> > +	BUG();
> > +	if (nidp)
> > +		*nidp = -1;
> > +	return NULL;
> > +}
> > +
> > +static void hugetlb_unregister_all_nodes(void) { }
> > +
> > +static void hugetlb_register_all_nodes(void) { }
> > +
> > +#endif
> > +
> >  static void __exit hugetlb_exit(void)
> >  {
> >  	struct hstate *h;
> >  
> > +	hugetlb_unregister_all_nodes();
> > +
> >  	for_each_hstate(h) {
> >  		kobject_put(hstate_kobjs[h - hstates]);
> >  	}
> > @@ -1496,6 +1678,8 @@ static int __init hugetlb_init(void)
> >  
> >  	hugetlb_sysfs_init();
> >  
> > +	hugetlb_register_all_nodes();
> > +
> >  	return 0;
> >  }
> >  module_init(hugetlb_init);
> > @@ -1598,7 +1782,8 @@ int hugetlb_sysctl_handler(struct ctl_ta
> >  	proc_doulongvec_minmax(table, write, buffer, length, ppos);
> >  
> >  	if (write)
> > -		h->max_huge_pages = set_max_huge_pages(h, tmp);
> > +		h->max_huge_pages = set_max_huge_pages(h, tmp,
> > +		                                       NO_NODEID_SPECIFIED);
> >  
> >  	return 0;
> >  }
> > Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h
> > ===================================================================
> > --- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/numa.h	2009-08-28 09:21:17.000000000 -0400
> > +++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h	2009-08-28 09:21:31.000000000 -0400
> > @@ -10,4 +10,6 @@
> >  
> >  #define MAX_NUMNODES    (1 << NODES_SHIFT)
> >  
> > +#define NO_NODEID_SPECIFIED	(-1)
> > +
> >  #endif /* _LINUX_NUMA_H */
> 
> Hmm, so we already have NUMA_NO_NODE in the ia64 and x86_64 code and 
> NID_INVAL in the ACPI code, both of which are defined to -1.  Maybe rename 
> your addition here in favor of NUMA_NO_NODE, remove it from the ia64 and 
> x86 arch headers, and convert the ACPI code?

OK, replacing 'NO_NODEID_SPECIFIED' with 'NUMA_NO_NODE,' works w/o
decending into header dependency hell.  The symbol is already visible in
hugetlb.c  I'll fix that.  But, ACPI?  Not today, thanks :).  

> 
> Thanks for doing this!

Well, we do need this, as well.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-08-28 16:03   ` Lee Schermerhorn
  (?)
  (?)
@ 2009-09-03 20:42   ` Randy Dunlap
  2009-09-04 15:23     ` Lee Schermerhorn
  -1 siblings, 1 reply; 81+ messages in thread
From: Randy Dunlap @ 2009-09-03 20:42 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

On Fri, 28 Aug 2009 12:03:51 -0400 Lee Schermerhorn wrote:

(Thanks for cc:, David.)


> [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
> 
> Against: 2.6.31-rc7-mmotm-090827-0057
> 
> V2:  Add brief description of per node attributes.
> 
> This patch updates the kernel huge tlb documentation to describe the
> numa memory policy based huge page management.  Additionaly, the patch
> includes a fair amount of rework to improve consistency, eliminate
> duplication and set the context for documenting the memory policy
> interaction.
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  Documentation/vm/hugetlbpage.txt |  257 ++++++++++++++++++++++++++-------------
>  1 file changed, 172 insertions(+), 85 deletions(-)
> 
> Index: linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/Documentation/vm/hugetlbpage.txt	2009-08-28 09:21:16.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt	2009-08-28 09:21:32.000000000 -0400

> @@ -53,26 +51,25 @@ HugePages_Surp  is short for "surplus,"
>  /proc/filesystems should also show a filesystem of type "hugetlbfs" configured
>  in the kernel.
>  
> -/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
> -pages in the kernel.  Super user can dynamically request more (or free some
> -pre-configured) huge pages.
> -The allocation (or deallocation) of hugetlb pages is possible only if there are
> -enough physically contiguous free pages in system (freeing of huge pages is
> -possible only if there are enough hugetlb pages free that can be transferred
> -back to regular memory pool).
> -
> -Pages that are used as hugetlb pages are reserved inside the kernel and cannot
> -be used for other purposes.
> -
> -Once the kernel with Hugetlb page support is built and running, a user can
> -use either the mmap system call or shared memory system calls to start using
> -the huge pages.  It is required that the system administrator preallocate
> -enough memory for huge page purposes.
> -
> -The administrator can preallocate huge pages on the kernel boot command line by
> -specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
> -requested.  This is the most reliable method for preallocating huge pages as
> -memory has not yet become fragmented.
> +/proc/sys/vm/nr_hugepages indicates the current number of huge pages pre-
> +allocated in the kernel's huge page pool.  These are called "persistent"
> +huge pages.  A user with root privileges can dynamically allocate more or
> +free some persistent huge pages by increasing or decreasing the value of
> +'nr_hugepages'.
> +
> +Pages that are used as huge pages are reserved inside the kernel and cannot
> +be used for other purposes.  Huge pages can not be swapped out under

                                           cannot

> +memory pressure.
> +
> +Once a number of huge pages have been pre-allocated to the kernel huge page
> +pool, a user with appropriate privilege can use either the mmap system call
> +or shared memory system calls to use the huge pages.  See the discussion of
> +Using Huge Pages, below

                     below.

> +
> +The administrator can preallocate persistent huge pages on the kernel boot
> +command line by specifying the "hugepages=N" parameter, where 'N' = the
> +number of requested huge pages requested.  This is the most reliable method

drop first "requested"

> +or preallocating huge pages as memory has not yet become fragmented.

   of

>  
>  Some platforms support multiple huge page sizes.  To preallocate huge pages
>  of a specific size, one must preceed the huge pages boot command parameters
> @@ -80,19 +77,24 @@ with a huge page size selection paramete
>  be specified in bytes with optional scale suffix [kKmMgG].  The default huge
>  page size may be selected with the "default_hugepagesz=<size>" boot parameter.
>  
> -/proc/sys/vm/nr_hugepages indicates the current number of configured [default
> -size] hugetlb pages in the kernel.  Super user can dynamically request more
> -(or free some pre-configured) huge pages.
> -
> -Use the following command to dynamically allocate/deallocate default sized
> -huge pages:
> +When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
> +indicates the current number of pre-allocated huge pages of the default size.
> +Thus, one can use the following command to dynamically allocate/deallocate
> +default sized persistent huge pages:
>  
>  	echo 20 > /proc/sys/vm/nr_hugepages
>  
> -This command will try to configure 20 default sized huge pages in the system.
> +This command will try to adjust the number of default sized huge pages in the
> +huge page pool to 20, allocating or freeing huge pages, as required.
> +
>  On a NUMA platform, the kernel will attempt to distribute the huge page pool
> -over the all on-line nodes.  These huge pages, allocated when nr_hugepages
> -is increased, are called "persistent huge pages".
> +over the all the nodes specified by the NUMA memory policy of the task that

drop first "the"

> +modifies nr_hugepages that contain sufficient available contiguous memory.

whoa.  too many "that"s.  confusing.


> +These nodes are called the huge pages "allowed nodes".  The default for the
> +huge pages allowed nodes--when the task has default memory policy--is all
> +on-line nodes.  See the discussion below of the interaction of task memory
> +policy, cpusets and per node attributes with the allocation and freeing of
> +persistent huge pages.
>  
>  The success or failure of huge page allocation depends on the amount of
>  physically contiguous memory that is preset in system at the time of the
> @@ -101,11 +103,11 @@ some nodes in a NUMA system, it will att
...

> @@ -113,39 +115,40 @@ distribution of huge pages in a NUMA sys
>  /proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
>  huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
>  requested by applications.  Writing any non-zero value into this file
> -indicates that the hugetlb subsystem is allowed to try to obtain "surplus"
> -huge pages from the buddy allocator, when the normal pool is exhausted. As
> -these surplus huge pages go out of use, they are freed back to the buddy
> -allocator.
> +indicates that the hugetlb subsystem is allowed to try to obtain that
> +number of "surplus" huge pages from the kernel's normal page pool, when the
> +persistent huge page pool is exhausted. As these surplus huge pages become
> +unused, they are freed back to the kernel's normal page pool.
>  
> -When increasing the huge page pool size via nr_hugepages, any surplus
> +When increasing the huge page pool size via nr_hugepages, any existing surplus
>  pages will first be promoted to persistent huge pages.  Then, additional
>  huge pages will be allocated, if necessary and if possible, to fulfill
> -the new huge page pool size.
> +the new persistent huge page pool size.
>  
>  The administrator may shrink the pool of preallocated huge pages for
>  the default huge page size by setting the nr_hugepages sysctl to a
>  smaller value.  The kernel will attempt to balance the freeing of huge pages
> -across all on-line nodes.  Any free huge pages on the selected nodes will
> -be freed back to the buddy allocator.
> -
> -Caveat: Shrinking the pool via nr_hugepages such that it becomes less
> -than the number of huge pages in use will convert the balance to surplus
> -huge pages even if it would exceed the overcommit value.  As long as
> -this condition holds, however, no more surplus huge pages will be
> -allowed on the system until one of the two sysctls are increased
> -sufficiently, or the surplus huge pages go out of use and are freed.
> +across all nodes in the memory policy of the task modifying nr_hugepages.
> +Any free huge pages on the selected nodes will be freed back to the kernel's
> +normal page pool.
> +
> +Caveat: Shrinking the persistent huge page pool via nr_hugepages such that
> +it becomes less than the number of huge pages in use will convert the balance
> +of the in-use huge pages to surplus huge pages.  This will occur even if

                               surplus allocated huge pages
? vs. surplus available huge pages?

surplus (to me) implies available/unallocated...

Reading more below, I see that "surplus" here means "overcommitted".  oh well ;)


> +the number of surplus pages it would exceed the overcommit value.  As long as
> +this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is
> +increased sufficiently, or the surplus huge pages go out of use and are freed--
> +no more surplus huge pages will be allowed to be allocated.
>  
>  With support for multiple huge page pools at run-time available, much of
> -the huge page userspace interface has been duplicated in sysfs. The above
> -information applies to the default huge page size which will be
> -controlled by the /proc interfaces for backwards compatibility. The root
> -huge page control directory in sysfs is:
> +the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs.
> +The /proc interfaces discussed above have been retained for backwards
> +compatibility. The root huge page control directory in sysfs is:
>  
>  	/sys/kernel/mm/hugepages
>  
>  For each huge page size supported by the running kernel, a subdirectory
> -will exist, of the form
> +will exist, of the form:
>  
>  	hugepages-${size}kB
>  
> @@ -159,6 +162,98 @@ Inside each of these directories, the sa
>  
>  which function as described above for the default huge page-sized case.
>  
> +
> +Interaction of Task Memory Policy with Huge Page Allocation/Freeing:
> +
> +Whether huge pages are allocated and freed via the /proc interface or
> +the /sysfs interface, the NUMA nodes from which huge pages are allocated
> +or freed are controlled by the NUMA memory policy of the task that modifies
> +the nr_hugepages parameter.  [nr_overcommit_hugepages is a global limit.]
> +
> +The recommended method to allocate or free huge pages to/from the kernel
> +huge page pool, using the nr_hugepages example above, is:
> +
> +    numactl --interleave <node-list> echo 20 >/proc/sys/vm/nr_hugepages.

drop '.'

> +
> +or, more succinctly:
> +
> +    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages.

ditto


> +
> +This will allocate or free abs(20 - nr_hugepages) to or from the nodes
> +specified in <node-list>, depending on whether nr_hugepages is initially
> +less than or greater than 20, respectively.  No huge pages will be
> +allocated nor freed on any node not included in the specified <node-list>.
> +
> +Any memory policy mode--bind, preferred, local or interleave--may be
> +used.  The effect on persistent huge page allocation will be as follows:

I would just use present tense as much as possible, e.g.,
                                             allocation is as follows:

> +
> +1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
> +   persistent huge pages will be distributed across the node or nodes
> +   specified in the mempolicy as if "interleave" had been specified.
> +   However, if a node in the policy does not contain sufficient contiguous
> +   memory for a huge page, the allocation will not "fallback" to the nearest
> +   neighbor node with sufficient contiguous memory.  To do this would cause
> +   undesirable imbalance in the distribution of the huge page pool, or
> +   possibly, allocation of persistent huge pages on nodes not allowed by
> +   the task's memory policy.
> +
> +2) One or more nodes may be specified with the bind or interleave policy.
> +   If more than one node is specified with the preferred policy, only the
> +   lowest numeric id will be used.  Local policy will select the node where
> +   the task is running at the time the nodes_allowed mask is constructed.
> +
> +3) For local policy to be deterministic, the task must be bound to a cpu or
> +   cpus in a single node.  Otherwise, the task could be migrated to some

I prefer s/cpu/CPU/ in all of Documentation/ text, but the cat is already out
of the bag on that.

> +   other node at any time after launch and the resulting node will be
> +   indeterminate.  Thus, local policy is not very useful for this purpose.
> +   Any of the other mempolicy modes may be used to specify a single node.
> +
> +4) The nodes allowed mask will be derived from any non-default task mempolicy,
> +   whether this policy was set explicitly by the task itself or one of its
> +   ancestors, such as numactl.  This means that if the task is invoked from a
> +   shell with non-default policy, that policy will be used.  One can specify a
> +   node list of "all" with numactl --interleave or --membind [-m] to achieve
> +   interleaving over all nodes in the system or cpuset.
> +
> +5) Any task mempolicy specifed--e.g., using numactl--will be constrained by
> +   the resource limits of any cpuset in which the task runs.  Thus, there will
> +   be no way for a task with non-default policy running in a cpuset with a
> +   subset of the system nodes to allocate huge pages outside the cpuset
> +   without first moving to a cpuset that contains all of the desired nodes.
> +
> +6) Hugepages allocated at boot time always use the node_online_map.
> +
> +
> +Per Node Hugepages Attributes
> +
> +A subset of the contents of the root huge page control directory in sysfs,
> +described above, has been replicated under each "node" system device in:
> +
> +	/sys/devices/system/node/node[0-9]*/hugepages/
> +
> +Under this directory, the subdirectory for each supported huge page size
> +contains the following attribute files:
> +
> +	nr_hugepages
> +	free_hugepages
> +	surplus_hugepages
> +
> +The free_' and surplus_' attribute files are read-only.  They return the number
> +of free and surplus [overcommitted] huge pages, respectively, on the parent
> +node.
> +
> +The nr_hugepages attribute will return the total number of huge pages on the
> +specified node.  When this attribute is written, the number of persistent huge
> +pages on the parent node will be adjusted to the specified value, if sufficient
> +resources exist, regardless of the task's mempolicy or cpuset constraints.
> +
> +Note that the number of overcommit and reserve pages remain global quantities,
> +as we don't know until fault time, when the faulting task's mempolicy is applied,
> +from which node the huge page allocation will be attempted.
> +
> +
> +Using Huge Pages:
> +
>  If the user applications are going to request huge pages using mmap system
>  call, then it is required that system administrator mount a file system of
>  type hugetlbfs:
> @@ -206,9 +301,11 @@ map_hugetlb.c.
...

> @@ -237,14 +334,8 @@ map_hugetlb.c.
...

> @@ -302,10 +393,12 @@ int main(void)
...

> @@ -317,14 +410,8 @@ int main(void)
...


---
~Randy
LPC 2009, Sept. 23-25, Portland, Oregon
http://linuxplumbersconf.org/2009/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] hugetlb:  derive huge pages nodes allowed from task mempolicy
  2009-09-03 20:15       ` Lee Schermerhorn
@ 2009-09-03 20:49         ` David Rientjes
  -1 siblings, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-03 20:49 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Mel Gorman, Nishanth Aravamudan, linux-numa,
	Adam Litke, Andy Whitcroft, eric.whitney

On Thu, 3 Sep 2009, Lee Schermerhorn wrote:

> > This isn't limited to only hugepage code, so a more appropriate name would 
> > probably be better.
> 
> Currently, this function is very much limited only to hugepage code.
> Most [all?] other users of mempolicy just use the alloc_vma_pages() and
> company, w/o cracking open the mempolicy.   I suppose something might
> come along that wants to open code interleaving over this mask, the way
> hugepage code does.  We could generalize it, then.  However, I'm not
> opposed to changing it to something like
> "alloc_nodemask_of_mempolicy()".   I still want to keep it in
> mempolicy.c, tho'.
> 
> Would this work for you?
>   

Yeah, it's not hugepage specific at all so mm/mempolicy.c is the only 
place for it anyway.  I just didn't think it needed `huge' in its name 
since it may get additional callers later.  alloc_nodemask_of_mempolicy() 
certainly sounds like a good generic function with a well defined purpose.

> > It'd probably be better to check for a NULL nodes_allowed either in 
> > set_max_huge_pages() than in hstate_next_node_to_{alloc,free} just for the 
> > cleanliness of the code OR simply return node_online_map from this 
> > function for default policies.
> 
> Yeah, I could pull the test up there to right after we check for a node
> id or task policy, and assign a pointer to node_online_map to
> nodes_allowed.  Then, I'll have to test for that condition before
> calling kfree().  I have no strong feelings about this.   I'll try to
> get this done for V6.  I'd like to get that out this week.
> 

&node_states[N_HIGH_MEMORY] as opposed to &node_online_map.  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] hugetlb:  derive huge pages nodes allowed from task mempolicy
@ 2009-09-03 20:49         ` David Rientjes
  0 siblings, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-03 20:49 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Mel Gorman, Nishanth Aravamudan, linux-numa,
	Adam Litke, Andy Whitcroft, eric.whitney

On Thu, 3 Sep 2009, Lee Schermerhorn wrote:

> > This isn't limited to only hugepage code, so a more appropriate name would 
> > probably be better.
> 
> Currently, this function is very much limited only to hugepage code.
> Most [all?] other users of mempolicy just use the alloc_vma_pages() and
> company, w/o cracking open the mempolicy.   I suppose something might
> come along that wants to open code interleaving over this mask, the way
> hugepage code does.  We could generalize it, then.  However, I'm not
> opposed to changing it to something like
> "alloc_nodemask_of_mempolicy()".   I still want to keep it in
> mempolicy.c, tho'.
> 
> Would this work for you?
>   

Yeah, it's not hugepage specific at all so mm/mempolicy.c is the only 
place for it anyway.  I just didn't think it needed `huge' in its name 
since it may get additional callers later.  alloc_nodemask_of_mempolicy() 
certainly sounds like a good generic function with a well defined purpose.

> > It'd probably be better to check for a NULL nodes_allowed either in 
> > set_max_huge_pages() than in hstate_next_node_to_{alloc,free} just for the 
> > cleanliness of the code OR simply return node_online_map from this 
> > function for default policies.
> 
> Yeah, I could pull the test up there to right after we check for a node
> id or task policy, and assign a pointer to node_online_map to
> nodes_allowed.  Then, I'll have to test for that condition before
> calling kfree().  I have no strong feelings about this.   I'll try to
> get this done for V6.  I'd like to get that out this week.
> 

&node_states[N_HIGH_MEMORY] as opposed to &node_online_map.  

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] hugetlb:  introduce alloc_nodemask_of_node
  2009-09-03 18:34         ` David Rientjes
  (?)
@ 2009-09-03 20:49         ` Lee Schermerhorn
  2009-09-03 21:03             ` David Rientjes
  -1 siblings, 1 reply; 81+ messages in thread
From: Lee Schermerhorn @ 2009-09-03 20:49 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, linux-mm, akpm, Nishanth Aravamudan, linux-numa,
	Adam Litke, Andy Whitcroft, eric.whitney

On Thu, 2009-09-03 at 11:34 -0700, David Rientjes wrote:
> On Tue, 1 Sep 2009, Lee Schermerhorn wrote:
> 
> > > > Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/nodemask.h
> > > > ===================================================================
> > > > --- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/nodemask.h	2009-08-28 09:21:19.000000000 -0400
> > > > +++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/nodemask.h	2009-08-28 09:21:29.000000000 -0400
> > > > @@ -245,18 +245,34 @@ static inline int __next_node(int n, con
> > > >  	return min_t(int,MAX_NUMNODES,find_next_bit(srcp->bits, MAX_NUMNODES, n+1));
> > > >  }
> > > >  
> > > > +#define init_nodemask_of_nodes(mask, node)				\
> > > > +	nodes_clear(*(mask));						\
> > > > +	node_set((node), *(mask));
> > > > +
> > > 
> > > Is the done thing to either make this a static inline or else wrap it in
> > > a do { } while(0) ? The reasoning being that if this is used as part of an
> > > another statement (e.g. a for loop) that it'll actually compile instead of
> > > throw up weird error messages.
> > 
> > Right.  I'll fix this [and signoff/review orders] next time [maybe last
> > time?].  It occurs to me that I can also use this for
> > huge_mpol_nodes_allowed(), so I'll move it up in the series and fix that
> > [which you've already ack'd].  I'll wait a bit to hear from David before
> > I respin.
> > 
> 
> I think it should be an inline function just so there's typechecking on 
> the first argument passed in (and so alloc_nodemask_of_node() below 
> doesn't get a NULL pointer dereference on node_set() if nmp can't be 
> allocated).

OK.  That works.  will be in v6

> 
> I've seen the issue about the signed-off-by/reviewed-by/acked-by order 
> come up before.  I've always put my signed-off-by line last whenever 
> proposing patches because it shows a clear order in who gathered those 
> lines when submitting to -mm, for example.  If I write
> 
> 	Cc: Mel Gorman <mel@csn.ul.ie>
> 	Signed-off-by: David Rientjes <rientjes@google.com>
> 
> it is clear that I cc'd Mel on the initial proposal.  If it is the other 
> way around, for example,
> 
> 	Signed-off-by: David Rientjes <rientjes@google.com>
> 	Cc: Mel Gorman <mel@csn.ul.ie>
> 	Signed-off-by: Andrew Morton...
> 
> then it indicates Andrew added the cc when merging into -mm.  That's more 
> relevant when such a line is acked-by or reviewed-by since it is now 
> possible to determine who received such acknowledgement from the 
> individual and is responsible for correctly relaying it in the patch 
> submission.
> 
> If it's done this way, it indicates that whoever is signing off the patch 
> is responsible for everything above it.  The type of line (signed-off-by, 
> reviewed-by, acked-by) is enough of an indication about the development 
> history of the patch, I believe, and it doesn't require specific ordering 
> to communicate (and the first line having to be a signed-off-by line isn't 
> really important, it doesn't replace the From: line).
> 
> It also appears to be how both Linus merges his own patches with Cc's.

???

> 
> > > > +/*
> > > > + * returns pointer to kmalloc()'d nodemask initialized to contain the
> > > > + * specified node.  Caller must free with kfree().
> > > > + */
> > > > +#define alloc_nodemask_of_node(node)					\
> > > > +({									\
> > > > +	typeof(_unused_nodemask_arg_) *nmp;				\
> > > > +	nmp = kmalloc(sizeof(*nmp), GFP_KERNEL);			\
> > > > +	if (nmp)							\
> > > > +		init_nodemask_of_nodes(nmp, (node));			\
> > > > +	nmp;								\
> > > > +})
> > > > +
> > > 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 5/6] hugetlb:  add per node hstate attributes
  2009-09-03 20:41       ` Lee Schermerhorn
@ 2009-09-03 21:02         ` David Rientjes
  -1 siblings, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-03 21:02 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, Andrew Morton, Mel Gorman, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

On Thu, 3 Sep 2009, Lee Schermerhorn wrote:

> > > @@ -1451,17 +1507,143 @@ static void __init hugetlb_sysfs_init(vo
> > >  		return;
> > >  
> > >  	for_each_hstate(h) {
> > > -		err = hugetlb_sysfs_add_hstate(h);
> > > +		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
> > > +					 hstate_kobjs, &hstate_attr_group);
> > >  		if (err)
> > >  			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
> > >  								h->name);
> > >  	}
> > >  }
> > >  
> > > +#ifdef CONFIG_NUMA
> > > +
> > > +struct node_hstate {
> > > +	struct kobject		*hugepages_kobj;
> > > +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
> > > +};
> > > +struct node_hstate node_hstates[MAX_NUMNODES];
> > > +
> > > +static struct attribute *per_node_hstate_attrs[] = {
> > > +	&nr_hugepages_attr.attr,
> > > +	&free_hugepages_attr.attr,
> > > +	&surplus_hugepages_attr.attr,
> > > +	NULL,
> > > +};
> > > +
> > > +static struct attribute_group per_node_hstate_attr_group = {
> > > +	.attrs = per_node_hstate_attrs,
> > > +};
> > > +
> > > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> > > +{
> > > +	int nid;
> > > +
> > > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > > +		struct node_hstate *nhs = &node_hstates[nid];
> > > +		int i;
> > > +		for (i = 0; i < HUGE_MAX_HSTATE; i++)
> > > +			if (nhs->hstate_kobjs[i] == kobj) {
> > > +				if (nidp)
> > > +					*nidp = nid;
> > > +				return &hstates[i];
> > > +			}
> > > +	}
> > > +
> > > +	BUG();
> > > +	return NULL;
> > > +}
> > > +
> > > +void hugetlb_unregister_node(struct node *node)
> > > +{
> > > +	struct hstate *h;
> > > +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> > > +
> > > +	if (!nhs->hugepages_kobj)
> > > +		return;
> > > +
> > > +	for_each_hstate(h)
> > > +		if (nhs->hstate_kobjs[h - hstates]) {
> > > +			kobject_put(nhs->hstate_kobjs[h - hstates]);
> > > +			nhs->hstate_kobjs[h - hstates] = NULL;
> > > +		}
> > > +
> > > +	kobject_put(nhs->hugepages_kobj);
> > > +	nhs->hugepages_kobj = NULL;
> > > +}
> > > +
> > > +static void hugetlb_unregister_all_nodes(void)
> > > +{
> > > +	int nid;
> > > +
> > > +	for (nid = 0; nid < nr_node_ids; nid++)
> > > +		hugetlb_unregister_node(&node_devices[nid]);
> > > +
> > > +	register_hugetlbfs_with_node(NULL, NULL);
> > > +}
> > > +
> > > +void hugetlb_register_node(struct node *node)
> > > +{
> > > +	struct hstate *h;
> > > +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> > > +	int err;
> > > +
> > > +	if (nhs->hugepages_kobj)
> > > +		return;		/* already allocated */
> > > +
> > > +	nhs->hugepages_kobj = kobject_create_and_add("hugepages",
> > > +							&node->sysdev.kobj);
> > > +	if (!nhs->hugepages_kobj)
> > > +		return;
> > > +
> > > +	for_each_hstate(h) {
> > > +		err = hugetlb_sysfs_add_hstate(h, nhs->hugepages_kobj,
> > > +						nhs->hstate_kobjs,
> > > +						&per_node_hstate_attr_group);
> > > +		if (err) {
> > > +			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
> > > +					" for node %d\n",
> > > +						h->name, node->sysdev.id);
> > 
> > Maybe add `err' to the printk so we know whether it was an -ENOMEM 
> > condition or sysfs problem?
> 
> Just the raw negative number?
> 

Sure.  I'm making the assumption that the printk is actually necessary in 
the first place, which is rather unorthodox for functions that can 
otherwise silently recover by unregistering the attribute.  Using the 
printk implies you want to know about the failure, yet additional 
debugging would be necessary to even identify what the failure was without 
printing the errno.

> > 
> > > +			hugetlb_unregister_node(node);
> > > +			break;
> > > +		}
> > > +	}
> > > +}
> > > +
> > > +static void hugetlb_register_all_nodes(void)
> > > +{
> > > +	int nid;
> > > +
> > > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > 
> > Don't you want to do this for all nodes in N_HIGH_MEMORY?  I don't think 
> > we should be adding attributes for memoryless nodes.
> 
> 
> Well, I wondered about that.  The persistent huge page allocation code
> is careful to skip over nodes where it can't allow a huge page there,
> whether or not the node has [any] memory.  So, it's safe to leave it
> this way.

It's safe, but it seems inconsistent to allow hugepage attributes to 
appear in /sys/devices/node/node* for nodes that have no memory.

> And, I was worried about the interaction with memory hotplug,
> as separate from node hotplug.  The current code handles node hot plug,
> bug I wasn't sure about memory hot-plug within a node.

If memory hotplug doesn't update nodes_state[N_HIGH_MEMORY] then there are 
plenty of other users that will currently fail as well.

> I.e., would the
> node driver get called to register the attributes in this case?

Not without a MEM_ONLINE notifier.

> Maybe
> that case doesn't exist, so I don't have to worry about it.   I think
> this is somewhat similar to the top cpuset mems_allowed being set to all
> possible to cover any subsequently added nodes/memory.
> 

That's because the page allocator's zonelists won't try to allocate from a 
memoryless node and the only hook into the cpuset code in that path is to 
check whether a nid is set in cpuset_current_mems_allowed.  It's quite 
different from providing per-node allocation and freeing mechanisms for 
pages on nodes without memory like this approach.

> > > Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h
> > > ===================================================================
> > > --- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/numa.h	2009-08-28 09:21:17.000000000 -0400
> > > +++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h	2009-08-28 09:21:31.000000000 -0400
> > > @@ -10,4 +10,6 @@
> > >  
> > >  #define MAX_NUMNODES    (1 << NODES_SHIFT)
> > >  
> > > +#define NO_NODEID_SPECIFIED	(-1)
> > > +
> > >  #endif /* _LINUX_NUMA_H */
> > 
> > Hmm, so we already have NUMA_NO_NODE in the ia64 and x86_64 code and 
> > NID_INVAL in the ACPI code, both of which are defined to -1.  Maybe rename 
> > your addition here in favor of NUMA_NO_NODE, remove it from the ia64 and 
> > x86 arch headers, and convert the ACPI code?
> 
> OK, replacing 'NO_NODEID_SPECIFIED' with 'NUMA_NO_NODE,' works w/o
> decending into header dependency hell.  The symbol is already visible in
> hugetlb.c  I'll fix that.

NUMA_NO_NODE may be visible in hugetlb.c for ia64 and x86, but probably 
not for other architectures so it should be moved to include/linux/numa.h.

> But, ACPI?  Not today, thanks :).

Darn :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 5/6] hugetlb:  add per node hstate attributes
@ 2009-09-03 21:02         ` David Rientjes
  0 siblings, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-03 21:02 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, Andrew Morton, Mel Gorman, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

On Thu, 3 Sep 2009, Lee Schermerhorn wrote:

> > > @@ -1451,17 +1507,143 @@ static void __init hugetlb_sysfs_init(vo
> > >  		return;
> > >  
> > >  	for_each_hstate(h) {
> > > -		err = hugetlb_sysfs_add_hstate(h);
> > > +		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
> > > +					 hstate_kobjs, &hstate_attr_group);
> > >  		if (err)
> > >  			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
> > >  								h->name);
> > >  	}
> > >  }
> > >  
> > > +#ifdef CONFIG_NUMA
> > > +
> > > +struct node_hstate {
> > > +	struct kobject		*hugepages_kobj;
> > > +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
> > > +};
> > > +struct node_hstate node_hstates[MAX_NUMNODES];
> > > +
> > > +static struct attribute *per_node_hstate_attrs[] = {
> > > +	&nr_hugepages_attr.attr,
> > > +	&free_hugepages_attr.attr,
> > > +	&surplus_hugepages_attr.attr,
> > > +	NULL,
> > > +};
> > > +
> > > +static struct attribute_group per_node_hstate_attr_group = {
> > > +	.attrs = per_node_hstate_attrs,
> > > +};
> > > +
> > > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> > > +{
> > > +	int nid;
> > > +
> > > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > > +		struct node_hstate *nhs = &node_hstates[nid];
> > > +		int i;
> > > +		for (i = 0; i < HUGE_MAX_HSTATE; i++)
> > > +			if (nhs->hstate_kobjs[i] == kobj) {
> > > +				if (nidp)
> > > +					*nidp = nid;
> > > +				return &hstates[i];
> > > +			}
> > > +	}
> > > +
> > > +	BUG();
> > > +	return NULL;
> > > +}
> > > +
> > > +void hugetlb_unregister_node(struct node *node)
> > > +{
> > > +	struct hstate *h;
> > > +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> > > +
> > > +	if (!nhs->hugepages_kobj)
> > > +		return;
> > > +
> > > +	for_each_hstate(h)
> > > +		if (nhs->hstate_kobjs[h - hstates]) {
> > > +			kobject_put(nhs->hstate_kobjs[h - hstates]);
> > > +			nhs->hstate_kobjs[h - hstates] = NULL;
> > > +		}
> > > +
> > > +	kobject_put(nhs->hugepages_kobj);
> > > +	nhs->hugepages_kobj = NULL;
> > > +}
> > > +
> > > +static void hugetlb_unregister_all_nodes(void)
> > > +{
> > > +	int nid;
> > > +
> > > +	for (nid = 0; nid < nr_node_ids; nid++)
> > > +		hugetlb_unregister_node(&node_devices[nid]);
> > > +
> > > +	register_hugetlbfs_with_node(NULL, NULL);
> > > +}
> > > +
> > > +void hugetlb_register_node(struct node *node)
> > > +{
> > > +	struct hstate *h;
> > > +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> > > +	int err;
> > > +
> > > +	if (nhs->hugepages_kobj)
> > > +		return;		/* already allocated */
> > > +
> > > +	nhs->hugepages_kobj = kobject_create_and_add("hugepages",
> > > +							&node->sysdev.kobj);
> > > +	if (!nhs->hugepages_kobj)
> > > +		return;
> > > +
> > > +	for_each_hstate(h) {
> > > +		err = hugetlb_sysfs_add_hstate(h, nhs->hugepages_kobj,
> > > +						nhs->hstate_kobjs,
> > > +						&per_node_hstate_attr_group);
> > > +		if (err) {
> > > +			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
> > > +					" for node %d\n",
> > > +						h->name, node->sysdev.id);
> > 
> > Maybe add `err' to the printk so we know whether it was an -ENOMEM 
> > condition or sysfs problem?
> 
> Just the raw negative number?
> 

Sure.  I'm making the assumption that the printk is actually necessary in 
the first place, which is rather unorthodox for functions that can 
otherwise silently recover by unregistering the attribute.  Using the 
printk implies you want to know about the failure, yet additional 
debugging would be necessary to even identify what the failure was without 
printing the errno.

> > 
> > > +			hugetlb_unregister_node(node);
> > > +			break;
> > > +		}
> > > +	}
> > > +}
> > > +
> > > +static void hugetlb_register_all_nodes(void)
> > > +{
> > > +	int nid;
> > > +
> > > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > 
> > Don't you want to do this for all nodes in N_HIGH_MEMORY?  I don't think 
> > we should be adding attributes for memoryless nodes.
> 
> 
> Well, I wondered about that.  The persistent huge page allocation code
> is careful to skip over nodes where it can't allow a huge page there,
> whether or not the node has [any] memory.  So, it's safe to leave it
> this way.

It's safe, but it seems inconsistent to allow hugepage attributes to 
appear in /sys/devices/node/node* for nodes that have no memory.

> And, I was worried about the interaction with memory hotplug,
> as separate from node hotplug.  The current code handles node hot plug,
> bug I wasn't sure about memory hot-plug within a node.

If memory hotplug doesn't update nodes_state[N_HIGH_MEMORY] then there are 
plenty of other users that will currently fail as well.

> I.e., would the
> node driver get called to register the attributes in this case?

Not without a MEM_ONLINE notifier.

> Maybe
> that case doesn't exist, so I don't have to worry about it.   I think
> this is somewhat similar to the top cpuset mems_allowed being set to all
> possible to cover any subsequently added nodes/memory.
> 

That's because the page allocator's zonelists won't try to allocate from a 
memoryless node and the only hook into the cpuset code in that path is to 
check whether a nid is set in cpuset_current_mems_allowed.  It's quite 
different from providing per-node allocation and freeing mechanisms for 
pages on nodes without memory like this approach.

> > > Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h
> > > ===================================================================
> > > --- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/numa.h	2009-08-28 09:21:17.000000000 -0400
> > > +++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h	2009-08-28 09:21:31.000000000 -0400
> > > @@ -10,4 +10,6 @@
> > >  
> > >  #define MAX_NUMNODES    (1 << NODES_SHIFT)
> > >  
> > > +#define NO_NODEID_SPECIFIED	(-1)
> > > +
> > >  #endif /* _LINUX_NUMA_H */
> > 
> > Hmm, so we already have NUMA_NO_NODE in the ia64 and x86_64 code and 
> > NID_INVAL in the ACPI code, both of which are defined to -1.  Maybe rename 
> > your addition here in favor of NUMA_NO_NODE, remove it from the ia64 and 
> > x86 arch headers, and convert the ACPI code?
> 
> OK, replacing 'NO_NODEID_SPECIFIED' with 'NUMA_NO_NODE,' works w/o
> decending into header dependency hell.  The symbol is already visible in
> hugetlb.c  I'll fix that.

NUMA_NO_NODE may be visible in hugetlb.c for ia64 and x86, but probably 
not for other architectures so it should be moved to include/linux/numa.h.

> But, ACPI?  Not today, thanks :).

Darn :)

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] hugetlb:  introduce alloc_nodemask_of_node
  2009-09-03 20:49         ` Lee Schermerhorn
@ 2009-09-03 21:03             ` David Rientjes
  0 siblings, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-03 21:03 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, linux-mm, akpm, Nishanth Aravamudan, linux-numa,
	Adam Litke, Andy Whitcroft, eric.whitney

On Thu, 3 Sep 2009, Lee Schermerhorn wrote:

> > I've seen the issue about the signed-off-by/reviewed-by/acked-by order 
> > come up before.  I've always put my signed-off-by line last whenever 
> > proposing patches because it shows a clear order in who gathered those 
> > lines when submitting to -mm, for example.  If I write
> > 
> > 	Cc: Mel Gorman <mel@csn.ul.ie>
> > 	Signed-off-by: David Rientjes <rientjes@google.com>
> > 
> > it is clear that I cc'd Mel on the initial proposal.  If it is the other 
> > way around, for example,
> > 
> > 	Signed-off-by: David Rientjes <rientjes@google.com>
> > 	Cc: Mel Gorman <mel@csn.ul.ie>
> > 	Signed-off-by: Andrew Morton...
> > 
> > then it indicates Andrew added the cc when merging into -mm.  That's more 
> > relevant when such a line is acked-by or reviewed-by since it is now 
> > possible to determine who received such acknowledgement from the 
> > individual and is responsible for correctly relaying it in the patch 
> > submission.
> > 
> > If it's done this way, it indicates that whoever is signing off the patch 
> > is responsible for everything above it.  The type of line (signed-off-by, 
> > reviewed-by, acked-by) is enough of an indication about the development 
> > history of the patch, I believe, and it doesn't require specific ordering 
> > to communicate (and the first line having to be a signed-off-by line isn't 
> > really important, it doesn't replace the From: line).
> > 
> > It also appears to be how both Linus merges his own patches with Cc's.
> 
> ???
> 

Not sure what's confusing about this, sorry.  You order your 
acked-by/reviewed-by/signed-off-by lines just like I have for years and I 
don't think it needs to be changed.  It shows a clear history of who did 
what in the path from original developer -> maintainer -> Linus.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] hugetlb:  introduce alloc_nodemask_of_node
@ 2009-09-03 21:03             ` David Rientjes
  0 siblings, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-03 21:03 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, linux-mm, akpm, Nishanth Aravamudan, linux-numa,
	Adam Litke, Andy Whitcroft, eric.whitney

On Thu, 3 Sep 2009, Lee Schermerhorn wrote:

> > I've seen the issue about the signed-off-by/reviewed-by/acked-by order 
> > come up before.  I've always put my signed-off-by line last whenever 
> > proposing patches because it shows a clear order in who gathered those 
> > lines when submitting to -mm, for example.  If I write
> > 
> > 	Cc: Mel Gorman <mel@csn.ul.ie>
> > 	Signed-off-by: David Rientjes <rientjes@google.com>
> > 
> > it is clear that I cc'd Mel on the initial proposal.  If it is the other 
> > way around, for example,
> > 
> > 	Signed-off-by: David Rientjes <rientjes@google.com>
> > 	Cc: Mel Gorman <mel@csn.ul.ie>
> > 	Signed-off-by: Andrew Morton...
> > 
> > then it indicates Andrew added the cc when merging into -mm.  That's more 
> > relevant when such a line is acked-by or reviewed-by since it is now 
> > possible to determine who received such acknowledgement from the 
> > individual and is responsible for correctly relaying it in the patch 
> > submission.
> > 
> > If it's done this way, it indicates that whoever is signing off the patch 
> > is responsible for everything above it.  The type of line (signed-off-by, 
> > reviewed-by, acked-by) is enough of an indication about the development 
> > history of the patch, I believe, and it doesn't require specific ordering 
> > to communicate (and the first line having to be a signed-off-by line isn't 
> > really important, it doesn't replace the From: line).
> > 
> > It also appears to be how both Linus merges his own patches with Cc's.
> 
> ???
> 

Not sure what's confusing about this, sorry.  You order your 
acked-by/reviewed-by/signed-off-by lines just like I have for years and I 
don't think it needs to be changed.  It shows a clear history of who did 
what in the path from original developer -> maintainer -> Linus.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-03 20:07     ` David Rientjes
@ 2009-09-03 21:09       ` Lee Schermerhorn
  -1 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-09-03 21:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Andrew Morton, Mel Gorman, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney,
	Randy Dunlap

On Thu, 2009-09-03 at 13:07 -0700, David Rientjes wrote:
> On Fri, 28 Aug 2009, Lee Schermerhorn wrote:
> 
> > [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
> > 
> > Against: 2.6.31-rc7-mmotm-090827-0057
> > 
> > V2:  Add brief description of per node attributes.
> > 
> > This patch updates the kernel huge tlb documentation to describe the
> > numa memory policy based huge page management.  Additionaly, the patch
> > includes a fair amount of rework to improve consistency, eliminate
> > duplication and set the context for documenting the memory policy
> > interaction.
> > 
> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
> Adding Randy to the cc.  Comments below, but otherwise:
> 
> Acked-by: David Rientjes <rientjes@google.com>
> 
> > 
> >  Documentation/vm/hugetlbpage.txt |  257 ++++++++++++++++++++++++++-------------
> >  1 file changed, 172 insertions(+), 85 deletions(-)
> > 
> > Index: linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt
> > ===================================================================
> > --- linux-2.6.31-rc7-mmotm-090827-0057.orig/Documentation/vm/hugetlbpage.txt	2009-08-28 09:21:16.000000000 -0400
> > +++ linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt	2009-08-28 09:21:32.000000000 -0400
> > @@ -11,23 +11,21 @@ This optimization is more critical now a
> >  (several GBs) are more readily available.
> >  
> >  Users can use the huge page support in Linux kernel by either using the mmap
> > -system call or standard SYSv shared memory system calls (shmget, shmat).
> > +system call or standard SYSV shared memory system calls (shmget, shmat).
> >  
> >  First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
> >  (present under "File systems") and CONFIG_HUGETLB_PAGE (selected
> >  automatically when CONFIG_HUGETLBFS is selected) configuration
> >  options.
> >  
> > -The kernel built with huge page support should show the number of configured
> > -huge pages in the system by running the "cat /proc/meminfo" command.
> > +The /proc/meminfo file provides information about the total number of hugetlb
> > +pages preallocated in the kernel's huge page pool.  It also displays
> > +information about the number of free, reserved and surplus huge pages and the
> > +[default] huge page size.  The huge page size is needed for generating the
> 
> Don't think the brackets are needed.

will fix

> 
> > +proper alignment and size of the arguments to system calls that map huge page
> > +regions.
> >  
> > -/proc/meminfo also provides information about the total number of hugetlb
> > -pages configured in the kernel.  It also displays information about the
> > -number of free hugetlb pages at any time.  It also displays information about
> > -the configured huge page size - this is needed for generating the proper
> > -alignment and size of the arguments to the above system calls.
> > -
> > -The output of "cat /proc/meminfo" will have lines like:
> > +The output of "cat /proc/meminfo" will include lines like:
> >  
> >  .....
> >  HugePages_Total: vvv
> > @@ -53,26 +51,25 @@ HugePages_Surp  is short for "surplus,"
> >  /proc/filesystems should also show a filesystem of type "hugetlbfs" configured
> >  in the kernel.
> >  
> > -/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
> > -pages in the kernel.  Super user can dynamically request more (or free some
> > -pre-configured) huge pages.
> > -The allocation (or deallocation) of hugetlb pages is possible only if there are
> > -enough physically contiguous free pages in system (freeing of huge pages is
> > -possible only if there are enough hugetlb pages free that can be transferred
> > -back to regular memory pool).
> > -
> > -Pages that are used as hugetlb pages are reserved inside the kernel and cannot
> > -be used for other purposes.
> > -
> > -Once the kernel with Hugetlb page support is built and running, a user can
> > -use either the mmap system call or shared memory system calls to start using
> > -the huge pages.  It is required that the system administrator preallocate
> > -enough memory for huge page purposes.
> > -
> > -The administrator can preallocate huge pages on the kernel boot command line by
> > -specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
> > -requested.  This is the most reliable method for preallocating huge pages as
> > -memory has not yet become fragmented.
> > +/proc/sys/vm/nr_hugepages indicates the current number of huge pages pre-
> > +allocated in the kernel's huge page pool.  These are called "persistent"
> > +huge pages.  A user with root privileges can dynamically allocate more or
> > +free some persistent huge pages by increasing or decreasing the value of
> > +'nr_hugepages'.
> > +
> 
> So they're not necessarily "preallocated" then if they're already in use.

I don't see what in the text you're referring to"  "preallocated" vs
"already in use" ???

> 
> > +Pages that are used as huge pages are reserved inside the kernel and cannot
> > +be used for other purposes.  Huge pages can not be swapped out under
> > +memory pressure.
> > +
> > +Once a number of huge pages have been pre-allocated to the kernel huge page
> > +pool, a user with appropriate privilege can use either the mmap system call
> > +or shared memory system calls to use the huge pages.  See the discussion of
> > +Using Huge Pages, below
> > +
> > +The administrator can preallocate persistent huge pages on the kernel boot
> > +command line by specifying the "hugepages=N" parameter, where 'N' = the
> > +number of requested huge pages requested.  This is the most reliable method
> > +or preallocating huge pages as memory has not yet become fragmented.
> >  
> >  Some platforms support multiple huge page sizes.  To preallocate huge pages
> >  of a specific size, one must preceed the huge pages boot command parameters
> > @@ -80,19 +77,24 @@ with a huge page size selection paramete
> >  be specified in bytes with optional scale suffix [kKmMgG].  The default huge
> >  page size may be selected with the "default_hugepagesz=<size>" boot parameter.
> >  
> > -/proc/sys/vm/nr_hugepages indicates the current number of configured [default
> > -size] hugetlb pages in the kernel.  Super user can dynamically request more
> > -(or free some pre-configured) huge pages.
> > -
> > -Use the following command to dynamically allocate/deallocate default sized
> > -huge pages:
> > +When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
> > +indicates the current number of pre-allocated huge pages of the default size.
> > +Thus, one can use the following command to dynamically allocate/deallocate
> > +default sized persistent huge pages:
> >  
> >  	echo 20 > /proc/sys/vm/nr_hugepages
> >  
> > -This command will try to configure 20 default sized huge pages in the system.
> > +This command will try to adjust the number of default sized huge pages in the
> > +huge page pool to 20, allocating or freeing huge pages, as required.
> > +
> >  On a NUMA platform, the kernel will attempt to distribute the huge page pool
> > -over the all on-line nodes.  These huge pages, allocated when nr_hugepages
> > -is increased, are called "persistent huge pages".
> > +over the all the nodes specified by the NUMA memory policy of the task that
> 
> Remove the first 'the'.

OK.

> 
> > +modifies nr_hugepages that contain sufficient available contiguous memory.
> > +These nodes are called the huge pages "allowed nodes".  The default for the
> 
> Not sure if you need to spell out that they're called "huge page allowed 
> nodes," isn't that an implementation detail?  The way Paul Jackson used to 
> describe nodes_allowed is "set of allowable nodes," and I can't think of a 
> better phrase.  That's also how the cpuset documentation describes them.

I wanted to refer to "huge pages allowed nodes" to differentiate from,
e.g., cpusets mems_allowed"--i.e., I wanted the "huge pages" qualifier.
I suppose I could introduce the phrase you suggest:  "set of allowable
nodes" and emphasize that in this doc, it only refers to nodes from
which persistent huge pages will be allocated.

> 
> > +huge pages allowed nodes--when the task has default memory policy--is all
> > +on-line nodes.  See the discussion below of the interaction of task memory
> 
> All online nodes with memory, right?

See response to comment on patch 5/6.  We can only allocate huge pages
from nodes that have them available, but the current code [before these
patches] does visit all on-line nodes.  As I mentioned, changing this
could have hotplug {imp|comp}lications, and for this patch set, I don't
want to go there.

> 
> > +policy, cpusets and per node attributes with the allocation and freeing of
> > +persistent huge pages.
> >  
> >  The success or failure of huge page allocation depends on the amount of
> >  physically contiguous memory that is preset in system at the time of the
> > @@ -101,11 +103,11 @@ some nodes in a NUMA system, it will att
> >  allocating extra pages on other nodes with sufficient available contiguous
> >  memory, if any.
> >  
> > -System administrators may want to put this command in one of the local rc init
> > -files.  This will enable the kernel to request huge pages early in the boot
> > -process when the possibility of getting physical contiguous pages is still
> > -very high.  Administrators can verify the number of huge pages actually
> > -allocated by checking the sysctl or meminfo.  To check the per node
> > +System administrators may want to put this command in one of the local rc
> > +init files.  This will enable the kernel to preallocate huge pages early in
> > +the boot process when the possibility of getting physical contiguous pages
> > +is still very high.  Administrators can verify the number of huge pages
> > +actually allocated by checking the sysctl or meminfo.  To check the per node
> >  distribution of huge pages in a NUMA system, use:
> >  
> >  	cat /sys/devices/system/node/node*/meminfo | fgrep Huge
> > @@ -113,39 +115,40 @@ distribution of huge pages in a NUMA sys
> >  /proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
> >  huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
> >  requested by applications.  Writing any non-zero value into this file
> > -indicates that the hugetlb subsystem is allowed to try to obtain "surplus"
> > -huge pages from the buddy allocator, when the normal pool is exhausted. As
> > -these surplus huge pages go out of use, they are freed back to the buddy
> > -allocator.
> > +indicates that the hugetlb subsystem is allowed to try to obtain that
> > +number of "surplus" huge pages from the kernel's normal page pool, when the
> > +persistent huge page pool is exhausted. As these surplus huge pages become
> > +unused, they are freed back to the kernel's normal page pool.
> >  
> > -When increasing the huge page pool size via nr_hugepages, any surplus
> > +When increasing the huge page pool size via nr_hugepages, any existing surplus
> >  pages will first be promoted to persistent huge pages.  Then, additional
> >  huge pages will be allocated, if necessary and if possible, to fulfill
> > -the new huge page pool size.
> > +the new persistent huge page pool size.
> >  
> >  The administrator may shrink the pool of preallocated huge pages for
> >  the default huge page size by setting the nr_hugepages sysctl to a
> >  smaller value.  The kernel will attempt to balance the freeing of huge pages
> > -across all on-line nodes.  Any free huge pages on the selected nodes will
> > -be freed back to the buddy allocator.
> > -
> > -Caveat: Shrinking the pool via nr_hugepages such that it becomes less
> > -than the number of huge pages in use will convert the balance to surplus
> > -huge pages even if it would exceed the overcommit value.  As long as
> > -this condition holds, however, no more surplus huge pages will be
> > -allowed on the system until one of the two sysctls are increased
> > -sufficiently, or the surplus huge pages go out of use and are freed.
> > +across all nodes in the memory policy of the task modifying nr_hugepages.
> > +Any free huge pages on the selected nodes will be freed back to the kernel's
> > +normal page pool.
> > +
> > +Caveat: Shrinking the persistent huge page pool via nr_hugepages such that
> > +it becomes less than the number of huge pages in use will convert the balance
> > +of the in-use huge pages to surplus huge pages.  This will occur even if
> > +the number of surplus pages it would exceed the overcommit value.  As long as
> > +this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is
> > +increased sufficiently, or the surplus huge pages go out of use and are freed--
> > +no more surplus huge pages will be allowed to be allocated.
> >  
> 
> Nice description!
> 
> >  With support for multiple huge page pools at run-time available, much of
> > -the huge page userspace interface has been duplicated in sysfs. The above
> > -information applies to the default huge page size which will be
> > -controlled by the /proc interfaces for backwards compatibility. The root
> > -huge page control directory in sysfs is:
> > +the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs.
> > +The /proc interfaces discussed above have been retained for backwards
> > +compatibility. The root huge page control directory in sysfs is:
> >  
> >  	/sys/kernel/mm/hugepages
> >  
> >  For each huge page size supported by the running kernel, a subdirectory
> > -will exist, of the form
> > +will exist, of the form:
> >  
> >  	hugepages-${size}kB
> >  
> > @@ -159,6 +162,98 @@ Inside each of these directories, the sa
> >  
> >  which function as described above for the default huge page-sized case.
> >  
> > +
> > +Interaction of Task Memory Policy with Huge Page Allocation/Freeing:
> > +
> > +Whether huge pages are allocated and freed via the /proc interface or
> > +the /sysfs interface, the NUMA nodes from which huge pages are allocated
> > +or freed are controlled by the NUMA memory policy of the task that modifies
> > +the nr_hugepages parameter.  [nr_overcommit_hugepages is a global limit.]
> > +
> > +The recommended method to allocate or free huge pages to/from the kernel
> > +huge page pool, using the nr_hugepages example above, is:
> > +
> > +    numactl --interleave <node-list> echo 20 >/proc/sys/vm/nr_hugepages.
> > +
> > +or, more succinctly:
> > +
> > +    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages.
> > +
> > +This will allocate or free abs(20 - nr_hugepages) to or from the nodes
> > +specified in <node-list>, depending on whether nr_hugepages is initially
> > +less than or greater than 20, respectively.  No huge pages will be
> > +allocated nor freed on any node not included in the specified <node-list>.
> > +
> 
> This is actually why I was against the mempolicy approach to begin with: 
> applications currently can free all hugepages on the system simply by 
> writing to nr_hugepages, regardless of their mempolicy.  It's now possible 
> that hugepages will remain allocated because they are on nodes disjoint 
> from current->mempolicy->v.nodes.  I hope the advantages of this approach 
> outweigh the potential userspace breakage of existing applications.

I understand.  However, I do think it's useful to support both a mask
[and Mel prefers it be based on mempolicy] and per node attributes.  On
some of our platforms, we do want explicit control over the placement of
huge pages--e.g., for a data base shared area or such.  So, we can say,
"I need <N> huge pages, and I want them on nodes 1, 3, 4 and 5", and
then, assuming we start with no huge pages allocated [free them all if
this is not the case]:

	numactl -m 1,3-5 hugeadm --pool-pages-min 2M:<N>

Later, if I decide that maybe I want to adjust the number on node 1, I
can:

	numactl -m 1 --pool-pages-min 2M:{+|-}<count>

or:

	echo <new-value> >/sys/devices/system/node/node1/hugepages/hugepages-2048KB/nr_hugepages

[Of course, I'd probably do this in a script to avoid all that typing :)]

> > +Any memory policy mode--bind, preferred, local or interleave--may be
> > +used.  The effect on persistent huge page allocation will be as follows:
> > +
> > +1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
> > +   persistent huge pages will be distributed across the node or nodes
> > +   specified in the mempolicy as if "interleave" had been specified.
> > +   However, if a node in the policy does not contain sufficient contiguous
> > +   memory for a huge page, the allocation will not "fallback" to the nearest
> > +   neighbor node with sufficient contiguous memory.  To do this would cause
> > +   undesirable imbalance in the distribution of the huge page pool, or
> > +   possibly, allocation of persistent huge pages on nodes not allowed by
> > +   the task's memory policy.
> > +
> 
> This is a good example of why the per-node tunables are helpful in case 
> such a fallback is desired.

Agreed.  And the fact that they do bypass any mempolicy.

> 
> > +2) One or more nodes may be specified with the bind or interleave policy.
> > +   If more than one node is specified with the preferred policy, only the
> > +   lowest numeric id will be used.  Local policy will select the node where
> > +   the task is running at the time the nodes_allowed mask is constructed.
> > +
> > +3) For local policy to be deterministic, the task must be bound to a cpu or
> > +   cpus in a single node.  Otherwise, the task could be migrated to some
> > +   other node at any time after launch and the resulting node will be
> > +   indeterminate.  Thus, local policy is not very useful for this purpose.
> > +   Any of the other mempolicy modes may be used to specify a single node.
> > +
> > +4) The nodes allowed mask will be derived from any non-default task mempolicy,
> > +   whether this policy was set explicitly by the task itself or one of its
> > +   ancestors, such as numactl.  This means that if the task is invoked from a
> > +   shell with non-default policy, that policy will be used.  One can specify a
> > +   node list of "all" with numactl --interleave or --membind [-m] to achieve
> > +   interleaving over all nodes in the system or cpuset.
> > +
> 
> Nice description.
> 
> > +5) Any task mempolicy specifed--e.g., using numactl--will be constrained by
> > +   the resource limits of any cpuset in which the task runs.  Thus, there will
> > +   be no way for a task with non-default policy running in a cpuset with a
> > +   subset of the system nodes to allocate huge pages outside the cpuset
> > +   without first moving to a cpuset that contains all of the desired nodes.
> > +
> > +6) Hugepages allocated at boot time always use the node_online_map.
> 
> Implementation detail in the name, maybe just say "all online nodes with 
> memory"?

OK.  will fix for V6.  soon come, I hope.

> 
> > +
> > +
> > +Per Node Hugepages Attributes
> > +
> > +A subset of the contents of the root huge page control directory in sysfs,
> > +described above, has been replicated under each "node" system device in:
> > +
> > +	/sys/devices/system/node/node[0-9]*/hugepages/
> > +
> > +Under this directory, the subdirectory for each supported huge page size
> > +contains the following attribute files:
> > +
> > +	nr_hugepages
> > +	free_hugepages
> > +	surplus_hugepages
> > +
> > +The free_' and surplus_' attribute files are read-only.  They return the number
> > +of free and surplus [overcommitted] huge pages, respectively, on the parent
> > +node.
> > +
> > +The nr_hugepages attribute will return the total number of huge pages on the
> > +specified node.  When this attribute is written, the number of persistent huge
> > +pages on the parent node will be adjusted to the specified value, if sufficient
> > +resources exist, regardless of the task's mempolicy or cpuset constraints.
> > +
> > +Note that the number of overcommit and reserve pages remain global quantities,
> > +as we don't know until fault time, when the faulting task's mempolicy is applied,
> > +from which node the huge page allocation will be attempted.
> > +
> > +
> > +Using Huge Pages:
> > +
> >  If the user applications are going to request huge pages using mmap system
> >  call, then it is required that system administrator mount a file system of
> >  type hugetlbfs:
> > @@ -206,9 +301,11 @@ map_hugetlb.c.
> >   * requesting huge pages.
> >   *
> >   * For the ia64 architecture, the Linux kernel reserves Region number 4 for
> > - * huge pages.  That means the addresses starting with 0x800000... will need
> > - * to be specified.  Specifying a fixed address is not required on ppc64,
> > - * i386 or x86_64.
> > + * huge pages.  That means that if one requires a fixed address, a huge page
> > + * aligned address starting with 0x800000... will be required.  If a fixed
> > + * address is not required, the kernel will select an address in the proper
> > + * range.
> > + * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
> >   *
> >   * Note: The default shared memory limit is quite low on many kernels,
> >   * you may need to increase it via:
> > @@ -237,14 +334,8 @@ map_hugetlb.c.
> >  
> >  #define dprintf(x)  printf(x)
> >  
> > -/* Only ia64 requires this */
> > -#ifdef __ia64__
> > -#define ADDR (void *)(0x8000000000000000UL)
> > -#define SHMAT_FLAGS (SHM_RND)
> > -#else
> > -#define ADDR (void *)(0x0UL)
> > +#define ADDR (void *)(0x0UL)	/* let kernel choose address */
> >  #define SHMAT_FLAGS (0)
> > -#endif
> >  
> >  int main(void)
> >  {
> > @@ -302,10 +393,12 @@ int main(void)
> >   * example, the app is requesting memory of size 256MB that is backed by
> >   * huge pages.
> >   *
> > - * For ia64 architecture, Linux kernel reserves Region number 4 for huge pages.
> > - * That means the addresses starting with 0x800000... will need to be
> > - * specified.  Specifying a fixed address is not required on ppc64, i386
> > - * or x86_64.
> > + * For the ia64 architecture, the Linux kernel reserves Region number 4 for
> > + * huge pages.  That means that if one requires a fixed address, a huge page
> > + * aligned address starting with 0x800000... will be required.  If a fixed
> > + * address is not required, the kernel will select an address in the proper
> > + * range.
> > + * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
> >   */
> >  #include <stdlib.h>
> >  #include <stdio.h>
> > @@ -317,14 +410,8 @@ int main(void)
> >  #define LENGTH (256UL*1024*1024)
> >  #define PROTECTION (PROT_READ | PROT_WRITE)
> >  
> > -/* Only ia64 requires this */
> > -#ifdef __ia64__
> > -#define ADDR (void *)(0x8000000000000000UL)
> > -#define FLAGS (MAP_SHARED | MAP_FIXED)
> > -#else
> > -#define ADDR (void *)(0x0UL)
> > +#define ADDR (void *)(0x0UL)	/* let kernel choose address */
> >  #define FLAGS (MAP_SHARED)
> > -#endif
> >  
> >  void check_bytes(char *addr)
> >  {
> > 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
@ 2009-09-03 21:09       ` Lee Schermerhorn
  0 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-09-03 21:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Andrew Morton, Mel Gorman, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney,
	Randy Dunlap

On Thu, 2009-09-03 at 13:07 -0700, David Rientjes wrote:
> On Fri, 28 Aug 2009, Lee Schermerhorn wrote:
> 
> > [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
> > 
> > Against: 2.6.31-rc7-mmotm-090827-0057
> > 
> > V2:  Add brief description of per node attributes.
> > 
> > This patch updates the kernel huge tlb documentation to describe the
> > numa memory policy based huge page management.  Additionaly, the patch
> > includes a fair amount of rework to improve consistency, eliminate
> > duplication and set the context for documenting the memory policy
> > interaction.
> > 
> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
> Adding Randy to the cc.  Comments below, but otherwise:
> 
> Acked-by: David Rientjes <rientjes@google.com>
> 
> > 
> >  Documentation/vm/hugetlbpage.txt |  257 ++++++++++++++++++++++++++-------------
> >  1 file changed, 172 insertions(+), 85 deletions(-)
> > 
> > Index: linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt
> > ===================================================================
> > --- linux-2.6.31-rc7-mmotm-090827-0057.orig/Documentation/vm/hugetlbpage.txt	2009-08-28 09:21:16.000000000 -0400
> > +++ linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt	2009-08-28 09:21:32.000000000 -0400
> > @@ -11,23 +11,21 @@ This optimization is more critical now a
> >  (several GBs) are more readily available.
> >  
> >  Users can use the huge page support in Linux kernel by either using the mmap
> > -system call or standard SYSv shared memory system calls (shmget, shmat).
> > +system call or standard SYSV shared memory system calls (shmget, shmat).
> >  
> >  First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
> >  (present under "File systems") and CONFIG_HUGETLB_PAGE (selected
> >  automatically when CONFIG_HUGETLBFS is selected) configuration
> >  options.
> >  
> > -The kernel built with huge page support should show the number of configured
> > -huge pages in the system by running the "cat /proc/meminfo" command.
> > +The /proc/meminfo file provides information about the total number of hugetlb
> > +pages preallocated in the kernel's huge page pool.  It also displays
> > +information about the number of free, reserved and surplus huge pages and the
> > +[default] huge page size.  The huge page size is needed for generating the
> 
> Don't think the brackets are needed.

will fix

> 
> > +proper alignment and size of the arguments to system calls that map huge page
> > +regions.
> >  
> > -/proc/meminfo also provides information about the total number of hugetlb
> > -pages configured in the kernel.  It also displays information about the
> > -number of free hugetlb pages at any time.  It also displays information about
> > -the configured huge page size - this is needed for generating the proper
> > -alignment and size of the arguments to the above system calls.
> > -
> > -The output of "cat /proc/meminfo" will have lines like:
> > +The output of "cat /proc/meminfo" will include lines like:
> >  
> >  .....
> >  HugePages_Total: vvv
> > @@ -53,26 +51,25 @@ HugePages_Surp  is short for "surplus,"
> >  /proc/filesystems should also show a filesystem of type "hugetlbfs" configured
> >  in the kernel.
> >  
> > -/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
> > -pages in the kernel.  Super user can dynamically request more (or free some
> > -pre-configured) huge pages.
> > -The allocation (or deallocation) of hugetlb pages is possible only if there are
> > -enough physically contiguous free pages in system (freeing of huge pages is
> > -possible only if there are enough hugetlb pages free that can be transferred
> > -back to regular memory pool).
> > -
> > -Pages that are used as hugetlb pages are reserved inside the kernel and cannot
> > -be used for other purposes.
> > -
> > -Once the kernel with Hugetlb page support is built and running, a user can
> > -use either the mmap system call or shared memory system calls to start using
> > -the huge pages.  It is required that the system administrator preallocate
> > -enough memory for huge page purposes.
> > -
> > -The administrator can preallocate huge pages on the kernel boot command line by
> > -specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
> > -requested.  This is the most reliable method for preallocating huge pages as
> > -memory has not yet become fragmented.
> > +/proc/sys/vm/nr_hugepages indicates the current number of huge pages pre-
> > +allocated in the kernel's huge page pool.  These are called "persistent"
> > +huge pages.  A user with root privileges can dynamically allocate more or
> > +free some persistent huge pages by increasing or decreasing the value of
> > +'nr_hugepages'.
> > +
> 
> So they're not necessarily "preallocated" then if they're already in use.

I don't see what in the text you're referring to"  "preallocated" vs
"already in use" ???

> 
> > +Pages that are used as huge pages are reserved inside the kernel and cannot
> > +be used for other purposes.  Huge pages can not be swapped out under
> > +memory pressure.
> > +
> > +Once a number of huge pages have been pre-allocated to the kernel huge page
> > +pool, a user with appropriate privilege can use either the mmap system call
> > +or shared memory system calls to use the huge pages.  See the discussion of
> > +Using Huge Pages, below
> > +
> > +The administrator can preallocate persistent huge pages on the kernel boot
> > +command line by specifying the "hugepages=N" parameter, where 'N' = the
> > +number of requested huge pages requested.  This is the most reliable method
> > +or preallocating huge pages as memory has not yet become fragmented.
> >  
> >  Some platforms support multiple huge page sizes.  To preallocate huge pages
> >  of a specific size, one must preceed the huge pages boot command parameters
> > @@ -80,19 +77,24 @@ with a huge page size selection paramete
> >  be specified in bytes with optional scale suffix [kKmMgG].  The default huge
> >  page size may be selected with the "default_hugepagesz=<size>" boot parameter.
> >  
> > -/proc/sys/vm/nr_hugepages indicates the current number of configured [default
> > -size] hugetlb pages in the kernel.  Super user can dynamically request more
> > -(or free some pre-configured) huge pages.
> > -
> > -Use the following command to dynamically allocate/deallocate default sized
> > -huge pages:
> > +When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
> > +indicates the current number of pre-allocated huge pages of the default size.
> > +Thus, one can use the following command to dynamically allocate/deallocate
> > +default sized persistent huge pages:
> >  
> >  	echo 20 > /proc/sys/vm/nr_hugepages
> >  
> > -This command will try to configure 20 default sized huge pages in the system.
> > +This command will try to adjust the number of default sized huge pages in the
> > +huge page pool to 20, allocating or freeing huge pages, as required.
> > +
> >  On a NUMA platform, the kernel will attempt to distribute the huge page pool
> > -over the all on-line nodes.  These huge pages, allocated when nr_hugepages
> > -is increased, are called "persistent huge pages".
> > +over the all the nodes specified by the NUMA memory policy of the task that
> 
> Remove the first 'the'.

OK.

> 
> > +modifies nr_hugepages that contain sufficient available contiguous memory.
> > +These nodes are called the huge pages "allowed nodes".  The default for the
> 
> Not sure if you need to spell out that they're called "huge page allowed 
> nodes," isn't that an implementation detail?  The way Paul Jackson used to 
> describe nodes_allowed is "set of allowable nodes," and I can't think of a 
> better phrase.  That's also how the cpuset documentation describes them.

I wanted to refer to "huge pages allowed nodes" to differentiate from,
e.g., cpusets mems_allowed"--i.e., I wanted the "huge pages" qualifier.
I suppose I could introduce the phrase you suggest:  "set of allowable
nodes" and emphasize that in this doc, it only refers to nodes from
which persistent huge pages will be allocated.

> 
> > +huge pages allowed nodes--when the task has default memory policy--is all
> > +on-line nodes.  See the discussion below of the interaction of task memory
> 
> All online nodes with memory, right?

See response to comment on patch 5/6.  We can only allocate huge pages
from nodes that have them available, but the current code [before these
patches] does visit all on-line nodes.  As I mentioned, changing this
could have hotplug {imp|comp}lications, and for this patch set, I don't
want to go there.

> 
> > +policy, cpusets and per node attributes with the allocation and freeing of
> > +persistent huge pages.
> >  
> >  The success or failure of huge page allocation depends on the amount of
> >  physically contiguous memory that is preset in system at the time of the
> > @@ -101,11 +103,11 @@ some nodes in a NUMA system, it will att
> >  allocating extra pages on other nodes with sufficient available contiguous
> >  memory, if any.
> >  
> > -System administrators may want to put this command in one of the local rc init
> > -files.  This will enable the kernel to request huge pages early in the boot
> > -process when the possibility of getting physical contiguous pages is still
> > -very high.  Administrators can verify the number of huge pages actually
> > -allocated by checking the sysctl or meminfo.  To check the per node
> > +System administrators may want to put this command in one of the local rc
> > +init files.  This will enable the kernel to preallocate huge pages early in
> > +the boot process when the possibility of getting physical contiguous pages
> > +is still very high.  Administrators can verify the number of huge pages
> > +actually allocated by checking the sysctl or meminfo.  To check the per node
> >  distribution of huge pages in a NUMA system, use:
> >  
> >  	cat /sys/devices/system/node/node*/meminfo | fgrep Huge
> > @@ -113,39 +115,40 @@ distribution of huge pages in a NUMA sys
> >  /proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
> >  huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
> >  requested by applications.  Writing any non-zero value into this file
> > -indicates that the hugetlb subsystem is allowed to try to obtain "surplus"
> > -huge pages from the buddy allocator, when the normal pool is exhausted. As
> > -these surplus huge pages go out of use, they are freed back to the buddy
> > -allocator.
> > +indicates that the hugetlb subsystem is allowed to try to obtain that
> > +number of "surplus" huge pages from the kernel's normal page pool, when the
> > +persistent huge page pool is exhausted. As these surplus huge pages become
> > +unused, they are freed back to the kernel's normal page pool.
> >  
> > -When increasing the huge page pool size via nr_hugepages, any surplus
> > +When increasing the huge page pool size via nr_hugepages, any existing surplus
> >  pages will first be promoted to persistent huge pages.  Then, additional
> >  huge pages will be allocated, if necessary and if possible, to fulfill
> > -the new huge page pool size.
> > +the new persistent huge page pool size.
> >  
> >  The administrator may shrink the pool of preallocated huge pages for
> >  the default huge page size by setting the nr_hugepages sysctl to a
> >  smaller value.  The kernel will attempt to balance the freeing of huge pages
> > -across all on-line nodes.  Any free huge pages on the selected nodes will
> > -be freed back to the buddy allocator.
> > -
> > -Caveat: Shrinking the pool via nr_hugepages such that it becomes less
> > -than the number of huge pages in use will convert the balance to surplus
> > -huge pages even if it would exceed the overcommit value.  As long as
> > -this condition holds, however, no more surplus huge pages will be
> > -allowed on the system until one of the two sysctls are increased
> > -sufficiently, or the surplus huge pages go out of use and are freed.
> > +across all nodes in the memory policy of the task modifying nr_hugepages.
> > +Any free huge pages on the selected nodes will be freed back to the kernel's
> > +normal page pool.
> > +
> > +Caveat: Shrinking the persistent huge page pool via nr_hugepages such that
> > +it becomes less than the number of huge pages in use will convert the balance
> > +of the in-use huge pages to surplus huge pages.  This will occur even if
> > +the number of surplus pages it would exceed the overcommit value.  As long as
> > +this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is
> > +increased sufficiently, or the surplus huge pages go out of use and are freed--
> > +no more surplus huge pages will be allowed to be allocated.
> >  
> 
> Nice description!
> 
> >  With support for multiple huge page pools at run-time available, much of
> > -the huge page userspace interface has been duplicated in sysfs. The above
> > -information applies to the default huge page size which will be
> > -controlled by the /proc interfaces for backwards compatibility. The root
> > -huge page control directory in sysfs is:
> > +the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs.
> > +The /proc interfaces discussed above have been retained for backwards
> > +compatibility. The root huge page control directory in sysfs is:
> >  
> >  	/sys/kernel/mm/hugepages
> >  
> >  For each huge page size supported by the running kernel, a subdirectory
> > -will exist, of the form
> > +will exist, of the form:
> >  
> >  	hugepages-${size}kB
> >  
> > @@ -159,6 +162,98 @@ Inside each of these directories, the sa
> >  
> >  which function as described above for the default huge page-sized case.
> >  
> > +
> > +Interaction of Task Memory Policy with Huge Page Allocation/Freeing:
> > +
> > +Whether huge pages are allocated and freed via the /proc interface or
> > +the /sysfs interface, the NUMA nodes from which huge pages are allocated
> > +or freed are controlled by the NUMA memory policy of the task that modifies
> > +the nr_hugepages parameter.  [nr_overcommit_hugepages is a global limit.]
> > +
> > +The recommended method to allocate or free huge pages to/from the kernel
> > +huge page pool, using the nr_hugepages example above, is:
> > +
> > +    numactl --interleave <node-list> echo 20 >/proc/sys/vm/nr_hugepages.
> > +
> > +or, more succinctly:
> > +
> > +    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages.
> > +
> > +This will allocate or free abs(20 - nr_hugepages) to or from the nodes
> > +specified in <node-list>, depending on whether nr_hugepages is initially
> > +less than or greater than 20, respectively.  No huge pages will be
> > +allocated nor freed on any node not included in the specified <node-list>.
> > +
> 
> This is actually why I was against the mempolicy approach to begin with: 
> applications currently can free all hugepages on the system simply by 
> writing to nr_hugepages, regardless of their mempolicy.  It's now possible 
> that hugepages will remain allocated because they are on nodes disjoint 
> from current->mempolicy->v.nodes.  I hope the advantages of this approach 
> outweigh the potential userspace breakage of existing applications.

I understand.  However, I do think it's useful to support both a mask
[and Mel prefers it be based on mempolicy] and per node attributes.  On
some of our platforms, we do want explicit control over the placement of
huge pages--e.g., for a data base shared area or such.  So, we can say,
"I need <N> huge pages, and I want them on nodes 1, 3, 4 and 5", and
then, assuming we start with no huge pages allocated [free them all if
this is not the case]:

	numactl -m 1,3-5 hugeadm --pool-pages-min 2M:<N>

Later, if I decide that maybe I want to adjust the number on node 1, I
can:

	numactl -m 1 --pool-pages-min 2M:{+|-}<count>

or:

	echo <new-value> >/sys/devices/system/node/node1/hugepages/hugepages-2048KB/nr_hugepages

[Of course, I'd probably do this in a script to avoid all that typing :)]

> > +Any memory policy mode--bind, preferred, local or interleave--may be
> > +used.  The effect on persistent huge page allocation will be as follows:
> > +
> > +1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
> > +   persistent huge pages will be distributed across the node or nodes
> > +   specified in the mempolicy as if "interleave" had been specified.
> > +   However, if a node in the policy does not contain sufficient contiguous
> > +   memory for a huge page, the allocation will not "fallback" to the nearest
> > +   neighbor node with sufficient contiguous memory.  To do this would cause
> > +   undesirable imbalance in the distribution of the huge page pool, or
> > +   possibly, allocation of persistent huge pages on nodes not allowed by
> > +   the task's memory policy.
> > +
> 
> This is a good example of why the per-node tunables are helpful in case 
> such a fallback is desired.

Agreed.  And the fact that they do bypass any mempolicy.

> 
> > +2) One or more nodes may be specified with the bind or interleave policy.
> > +   If more than one node is specified with the preferred policy, only the
> > +   lowest numeric id will be used.  Local policy will select the node where
> > +   the task is running at the time the nodes_allowed mask is constructed.
> > +
> > +3) For local policy to be deterministic, the task must be bound to a cpu or
> > +   cpus in a single node.  Otherwise, the task could be migrated to some
> > +   other node at any time after launch and the resulting node will be
> > +   indeterminate.  Thus, local policy is not very useful for this purpose.
> > +   Any of the other mempolicy modes may be used to specify a single node.
> > +
> > +4) The nodes allowed mask will be derived from any non-default task mempolicy,
> > +   whether this policy was set explicitly by the task itself or one of its
> > +   ancestors, such as numactl.  This means that if the task is invoked from a
> > +   shell with non-default policy, that policy will be used.  One can specify a
> > +   node list of "all" with numactl --interleave or --membind [-m] to achieve
> > +   interleaving over all nodes in the system or cpuset.
> > +
> 
> Nice description.
> 
> > +5) Any task mempolicy specifed--e.g., using numactl--will be constrained by
> > +   the resource limits of any cpuset in which the task runs.  Thus, there will
> > +   be no way for a task with non-default policy running in a cpuset with a
> > +   subset of the system nodes to allocate huge pages outside the cpuset
> > +   without first moving to a cpuset that contains all of the desired nodes.
> > +
> > +6) Hugepages allocated at boot time always use the node_online_map.
> 
> Implementation detail in the name, maybe just say "all online nodes with 
> memory"?

OK.  will fix for V6.  soon come, I hope.

> 
> > +
> > +
> > +Per Node Hugepages Attributes
> > +
> > +A subset of the contents of the root huge page control directory in sysfs,
> > +described above, has been replicated under each "node" system device in:
> > +
> > +	/sys/devices/system/node/node[0-9]*/hugepages/
> > +
> > +Under this directory, the subdirectory for each supported huge page size
> > +contains the following attribute files:
> > +
> > +	nr_hugepages
> > +	free_hugepages
> > +	surplus_hugepages
> > +
> > +The free_' and surplus_' attribute files are read-only.  They return the number
> > +of free and surplus [overcommitted] huge pages, respectively, on the parent
> > +node.
> > +
> > +The nr_hugepages attribute will return the total number of huge pages on the
> > +specified node.  When this attribute is written, the number of persistent huge
> > +pages on the parent node will be adjusted to the specified value, if sufficient
> > +resources exist, regardless of the task's mempolicy or cpuset constraints.
> > +
> > +Note that the number of overcommit and reserve pages remain global quantities,
> > +as we don't know until fault time, when the faulting task's mempolicy is applied,
> > +from which node the huge page allocation will be attempted.
> > +
> > +
> > +Using Huge Pages:
> > +
> >  If the user applications are going to request huge pages using mmap system
> >  call, then it is required that system administrator mount a file system of
> >  type hugetlbfs:
> > @@ -206,9 +301,11 @@ map_hugetlb.c.
> >   * requesting huge pages.
> >   *
> >   * For the ia64 architecture, the Linux kernel reserves Region number 4 for
> > - * huge pages.  That means the addresses starting with 0x800000... will need
> > - * to be specified.  Specifying a fixed address is not required on ppc64,
> > - * i386 or x86_64.
> > + * huge pages.  That means that if one requires a fixed address, a huge page
> > + * aligned address starting with 0x800000... will be required.  If a fixed
> > + * address is not required, the kernel will select an address in the proper
> > + * range.
> > + * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
> >   *
> >   * Note: The default shared memory limit is quite low on many kernels,
> >   * you may need to increase it via:
> > @@ -237,14 +334,8 @@ map_hugetlb.c.
> >  
> >  #define dprintf(x)  printf(x)
> >  
> > -/* Only ia64 requires this */
> > -#ifdef __ia64__
> > -#define ADDR (void *)(0x8000000000000000UL)
> > -#define SHMAT_FLAGS (SHM_RND)
> > -#else
> > -#define ADDR (void *)(0x0UL)
> > +#define ADDR (void *)(0x0UL)	/* let kernel choose address */
> >  #define SHMAT_FLAGS (0)
> > -#endif
> >  
> >  int main(void)
> >  {
> > @@ -302,10 +393,12 @@ int main(void)
> >   * example, the app is requesting memory of size 256MB that is backed by
> >   * huge pages.
> >   *
> > - * For ia64 architecture, Linux kernel reserves Region number 4 for huge pages.
> > - * That means the addresses starting with 0x800000... will need to be
> > - * specified.  Specifying a fixed address is not required on ppc64, i386
> > - * or x86_64.
> > + * For the ia64 architecture, the Linux kernel reserves Region number 4 for
> > + * huge pages.  That means that if one requires a fixed address, a huge page
> > + * aligned address starting with 0x800000... will be required.  If a fixed
> > + * address is not required, the kernel will select an address in the proper
> > + * range.
> > + * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
> >   */
> >  #include <stdlib.h>
> >  #include <stdio.h>
> > @@ -317,14 +410,8 @@ int main(void)
> >  #define LENGTH (256UL*1024*1024)
> >  #define PROTECTION (PROT_READ | PROT_WRITE)
> >  
> > -/* Only ia64 requires this */
> > -#ifdef __ia64__
> > -#define ADDR (void *)(0x8000000000000000UL)
> > -#define FLAGS (MAP_SHARED | MAP_FIXED)
> > -#else
> > -#define ADDR (void *)(0x0UL)
> > +#define ADDR (void *)(0x0UL)	/* let kernel choose address */
> >  #define FLAGS (MAP_SHARED)
> > -#endif
> >  
> >  void check_bytes(char *addr)
> >  {
> > 


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-03 21:09       ` Lee Schermerhorn
  (?)
@ 2009-09-03 21:25       ` David Rientjes
  2009-09-08 10:44           ` Mel Gorman
  -1 siblings, 1 reply; 81+ messages in thread
From: David Rientjes @ 2009-09-03 21:25 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, Andrew Morton, Mel Gorman, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney,
	Randy Dunlap

On Thu, 3 Sep 2009, Lee Schermerhorn wrote:

> > > @@ -53,26 +51,25 @@ HugePages_Surp  is short for "surplus,"
> > >  /proc/filesystems should also show a filesystem of type "hugetlbfs" configured
> > >  in the kernel.
> > >  
> > > -/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
> > > -pages in the kernel.  Super user can dynamically request more (or free some
> > > -pre-configured) huge pages.
> > > -The allocation (or deallocation) of hugetlb pages is possible only if there are
> > > -enough physically contiguous free pages in system (freeing of huge pages is
> > > -possible only if there are enough hugetlb pages free that can be transferred
> > > -back to regular memory pool).
> > > -
> > > -Pages that are used as hugetlb pages are reserved inside the kernel and cannot
> > > -be used for other purposes.
> > > -
> > > -Once the kernel with Hugetlb page support is built and running, a user can
> > > -use either the mmap system call or shared memory system calls to start using
> > > -the huge pages.  It is required that the system administrator preallocate
> > > -enough memory for huge page purposes.
> > > -
> > > -The administrator can preallocate huge pages on the kernel boot command line by
> > > -specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
> > > -requested.  This is the most reliable method for preallocating huge pages as
> > > -memory has not yet become fragmented.
> > > +/proc/sys/vm/nr_hugepages indicates the current number of huge pages pre-
> > > +allocated in the kernel's huge page pool.  These are called "persistent"
> > > +huge pages.  A user with root privileges can dynamically allocate more or
> > > +free some persistent huge pages by increasing or decreasing the value of
> > > +'nr_hugepages'.
> > > +
> > 
> > So they're not necessarily "preallocated" then if they're already in use.
> 
> I don't see what in the text you're referring to"  "preallocated" vs
> "already in use" ???
> 

Your new line, "/proc/sys/vm/nr_hugepages indicates the current number of 
huge pages preallocated in the kernel's huge page pool" doesn't seem 
correct since pages are not "pre"-allocated if they are used by an 
application.  Preallocation is only when pages are allocated for a 
performance optimization in a later hotpath (such as in a slab allocator) 
or when the allocation cannot be done later in a non-blocking context.  If 
you were to remove "pre" from that line it would be clear.

> > Not sure if you need to spell out that they're called "huge page allowed 
> > nodes," isn't that an implementation detail?  The way Paul Jackson used to 
> > describe nodes_allowed is "set of allowable nodes," and I can't think of a 
> > better phrase.  That's also how the cpuset documentation describes them.
> 
> I wanted to refer to "huge pages allowed nodes" to differentiate from,
> e.g., cpusets mems_allowed"--i.e., I wanted the "huge pages" qualifier.
> I suppose I could introduce the phrase you suggest:  "set of allowable
> nodes" and emphasize that in this doc, it only refers to nodes from
> which persistent huge pages will be allocated.
> 

It's a different story if you want to use the phrase "allowed nodes" 
throughout this document to mean "the set of allowed nodes from which to 
allocate hugepages depending on the allocating task's mempolicy," but I 
didn't see any future reference to that phrase in your changes anyway.

> I understand.  However, I do think it's useful to support both a mask
> [and Mel prefers it be based on mempolicy] and per node attributes.  On
> some of our platforms, we do want explicit control over the placement of
> huge pages--e.g., for a data base shared area or such.  So, we can say,
> "I need <N> huge pages, and I want them on nodes 1, 3, 4 and 5", and
> then, assuming we start with no huge pages allocated [free them all if
> this is not the case]:
> 
> 	numactl -m 1,3-5 hugeadm --pool-pages-min 2M:<N>
> 
> Later, if I decide that maybe I want to adjust the number on node 1, I
> can:
> 
> 	numactl -m 1 --pool-pages-min 2M:{+|-}<count>
> 
> or:
> 
> 	echo <new-value> >/sys/devices/system/node/node1/hugepages/hugepages-2048KB/nr_hugepages
> 
> [Of course, I'd probably do this in a script to avoid all that typing :)]
> 

Yes, but the caveat I'm pointing out (and is really clearly described in 
your documentation changes here) is that existing applications, shell 
scripts, job schedulers, whatever, which currently free all system 
hugepages (or do so at a consistent interval down to the surplus 
value to reclaim memory) will now leak disjoint pages since the freeing is 
now governed by its mempolicy.  If the benefits of doing this 
significantly outweigh that potential for userspace breakage, I have no 
objection to it.  I just can't say for certain that it is.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 5/6] hugetlb:  add per node hstate attributes
  2009-09-03 21:02         ` David Rientjes
@ 2009-09-04 14:30           ` Lee Schermerhorn
  -1 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-09-04 14:30 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Andrew Morton, Mel Gorman, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

On Thu, 2009-09-03 at 14:02 -0700, David Rientjes wrote:
> On Thu, 3 Sep 2009, Lee Schermerhorn wrote:
> 
> > > > @@ -1451,17 +1507,143 @@ static void __init hugetlb_sysfs_init(vo
> > > >  		return;
> > > >  
> > > >  	for_each_hstate(h) {
> > > > -		err = hugetlb_sysfs_add_hstate(h);
> > > > +		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
> > > > +					 hstate_kobjs, &hstate_attr_group);
> > > >  		if (err)
> > > >  			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
> > > >  								h->name);
> > > >  	}
> > > >  }
> > > >  
> > > > +#ifdef CONFIG_NUMA
> > > > +
> > > > +struct node_hstate {
> > > > +	struct kobject		*hugepages_kobj;
> > > > +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
> > > > +};
> > > > +struct node_hstate node_hstates[MAX_NUMNODES];
> > > > +
> > > > +static struct attribute *per_node_hstate_attrs[] = {
> > > > +	&nr_hugepages_attr.attr,
> > > > +	&free_hugepages_attr.attr,
> > > > +	&surplus_hugepages_attr.attr,
> > > > +	NULL,
> > > > +};
> > > > +
> > > > +static struct attribute_group per_node_hstate_attr_group = {
> > > > +	.attrs = per_node_hstate_attrs,
> > > > +};
> > > > +
> > > > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> > > > +{
> > > > +	int nid;
> > > > +
> > > > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > > > +		struct node_hstate *nhs = &node_hstates[nid];
> > > > +		int i;
> > > > +		for (i = 0; i < HUGE_MAX_HSTATE; i++)
> > > > +			if (nhs->hstate_kobjs[i] == kobj) {
> > > > +				if (nidp)
> > > > +					*nidp = nid;
> > > > +				return &hstates[i];
> > > > +			}
> > > > +	}
> > > > +
> > > > +	BUG();
> > > > +	return NULL;
> > > > +}
> > > > +
> > > > +void hugetlb_unregister_node(struct node *node)
> > > > +{
> > > > +	struct hstate *h;
> > > > +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> > > > +
> > > > +	if (!nhs->hugepages_kobj)
> > > > +		return;
> > > > +
> > > > +	for_each_hstate(h)
> > > > +		if (nhs->hstate_kobjs[h - hstates]) {
> > > > +			kobject_put(nhs->hstate_kobjs[h - hstates]);
> > > > +			nhs->hstate_kobjs[h - hstates] = NULL;
> > > > +		}
> > > > +
> > > > +	kobject_put(nhs->hugepages_kobj);
> > > > +	nhs->hugepages_kobj = NULL;
> > > > +}
> > > > +
> > > > +static void hugetlb_unregister_all_nodes(void)
> > > > +{
> > > > +	int nid;
> > > > +
> > > > +	for (nid = 0; nid < nr_node_ids; nid++)
> > > > +		hugetlb_unregister_node(&node_devices[nid]);
> > > > +
> > > > +	register_hugetlbfs_with_node(NULL, NULL);
> > > > +}
> > > > +
> > > > +void hugetlb_register_node(struct node *node)
> > > > +{
> > > > +	struct hstate *h;
> > > > +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> > > > +	int err;
> > > > +
> > > > +	if (nhs->hugepages_kobj)
> > > > +		return;		/* already allocated */
> > > > +
> > > > +	nhs->hugepages_kobj = kobject_create_and_add("hugepages",
> > > > +							&node->sysdev.kobj);
> > > > +	if (!nhs->hugepages_kobj)
> > > > +		return;
> > > > +
> > > > +	for_each_hstate(h) {
> > > > +		err = hugetlb_sysfs_add_hstate(h, nhs->hugepages_kobj,
> > > > +						nhs->hstate_kobjs,
> > > > +						&per_node_hstate_attr_group);
> > > > +		if (err) {
> > > > +			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
> > > > +					" for node %d\n",
> > > > +						h->name, node->sysdev.id);
> > > 
> > > Maybe add `err' to the printk so we know whether it was an -ENOMEM 
> > > condition or sysfs problem?
> > 
> > Just the raw negative number?
> > 
> 
> Sure.  I'm making the assumption that the printk is actually necessary in 
> the first place, which is rather unorthodox for functions that can 
> otherwise silently recover by unregistering the attribute.  Using the 
> printk implies you want to know about the failure, yet additional 
> debugging would be necessary to even identify what the failure was without 
> printing the errno.


David:  


I'm going to leave this printk as is.  I want to keep it, because the
global hstate function issues a similar printk [w/o] error code when it
can't add a global hstate.  The original authors considered this
necessary/useful, so I'm following suit.  The -ENOMEM error is returned
by hugetlb_sysfs_add_hstate() when kobject_create_and_add().  If the
kobject creation can fail for reasons other than ENOMEM, then showing
ENOMEM in the message will sometimes be bogus.  If it can only fail due
to lack of memory, then we are adding no additional info.

more below

> 
> > > 
> > > > +			hugetlb_unregister_node(node);
> > > > +			break;
> > > > +		}
> > > > +	}
> > > > +}
> > > > +
> > > > +static void hugetlb_register_all_nodes(void)
> > > > +{
> > > > +	int nid;
> > > > +
> > > > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > > 
> > > Don't you want to do this for all nodes in N_HIGH_MEMORY?  I don't think 
> > > we should be adding attributes for memoryless nodes.
> > 
> > 
> > Well, I wondered about that.  The persistent huge page allocation code
> > is careful to skip over nodes where it can't allow a huge page there,
> > whether or not the node has [any] memory.  So, it's safe to leave it
> > this way.
> 
> It's safe, but it seems inconsistent to allow hugepage attributes to 
> appear in /sys/devices/node/node* for nodes that have no memory.

OK.  I'm going to change this, and replace node_online_map with
node_state[N_HIGH_MEMORY] [or whatever] as a separate patch.  That way,
if someone complains that this doesn't work for memory hot plug, someone
who understands memory hotplut can fix it or we can drop that patch.

> 
> > And, I was worried about the interaction with memory hotplug,
> > as separate from node hotplug.  The current code handles node hot plug,
> > bug I wasn't sure about memory hot-plug within a node.
> 
> If memory hotplug doesn't update nodes_state[N_HIGH_MEMORY] then there are 
> plenty of other users that will currently fail as well.
> 
> > I.e., would the
> > node driver get called to register the attributes in this case?
> 
> Not without a MEM_ONLINE notifier.

That's what I thought.  I'm not up to speed on that area, and have no
bandwidth nor interest to go there, right now.  If you think that's a
show stopper for per node attributes, we can defer this part of the
series.  But, I'd prefer to stick with current "no-op" attributes for
memoryless nodes over that alternative.

> 
> > Maybe
> > that case doesn't exist, so I don't have to worry about it.   I think
> > this is somewhat similar to the top cpuset mems_allowed being set to all
> > possible to cover any subsequently added nodes/memory.
> > 
> 
> That's because the page allocator's zonelists won't try to allocate from a 
> memoryless node and the only hook into the cpuset code in that path is to 
> check whether a nid is set in cpuset_current_mems_allowed.  It's quite 
> different from providing per-node allocation and freeing mechanisms for 
> pages on nodes without memory like this approach.

OK, we have different perspectives on this.  I'm not at all offended by
the no-op attributes.  If you're worried about safety, I can check
explicitly in the attribute handlers and bail out early for memoryless
nodes.  Then, should someone hot add memory to the node, it will start
attemping to allocate huge pages when requested.

> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 5/6] hugetlb:  add per node hstate attributes
@ 2009-09-04 14:30           ` Lee Schermerhorn
  0 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-09-04 14:30 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Andrew Morton, Mel Gorman, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

On Thu, 2009-09-03 at 14:02 -0700, David Rientjes wrote:
> On Thu, 3 Sep 2009, Lee Schermerhorn wrote:
> 
> > > > @@ -1451,17 +1507,143 @@ static void __init hugetlb_sysfs_init(vo
> > > >  		return;
> > > >  
> > > >  	for_each_hstate(h) {
> > > > -		err = hugetlb_sysfs_add_hstate(h);
> > > > +		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
> > > > +					 hstate_kobjs, &hstate_attr_group);
> > > >  		if (err)
> > > >  			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
> > > >  								h->name);
> > > >  	}
> > > >  }
> > > >  
> > > > +#ifdef CONFIG_NUMA
> > > > +
> > > > +struct node_hstate {
> > > > +	struct kobject		*hugepages_kobj;
> > > > +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
> > > > +};
> > > > +struct node_hstate node_hstates[MAX_NUMNODES];
> > > > +
> > > > +static struct attribute *per_node_hstate_attrs[] = {
> > > > +	&nr_hugepages_attr.attr,
> > > > +	&free_hugepages_attr.attr,
> > > > +	&surplus_hugepages_attr.attr,
> > > > +	NULL,
> > > > +};
> > > > +
> > > > +static struct attribute_group per_node_hstate_attr_group = {
> > > > +	.attrs = per_node_hstate_attrs,
> > > > +};
> > > > +
> > > > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> > > > +{
> > > > +	int nid;
> > > > +
> > > > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > > > +		struct node_hstate *nhs = &node_hstates[nid];
> > > > +		int i;
> > > > +		for (i = 0; i < HUGE_MAX_HSTATE; i++)
> > > > +			if (nhs->hstate_kobjs[i] == kobj) {
> > > > +				if (nidp)
> > > > +					*nidp = nid;
> > > > +				return &hstates[i];
> > > > +			}
> > > > +	}
> > > > +
> > > > +	BUG();
> > > > +	return NULL;
> > > > +}
> > > > +
> > > > +void hugetlb_unregister_node(struct node *node)
> > > > +{
> > > > +	struct hstate *h;
> > > > +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> > > > +
> > > > +	if (!nhs->hugepages_kobj)
> > > > +		return;
> > > > +
> > > > +	for_each_hstate(h)
> > > > +		if (nhs->hstate_kobjs[h - hstates]) {
> > > > +			kobject_put(nhs->hstate_kobjs[h - hstates]);
> > > > +			nhs->hstate_kobjs[h - hstates] = NULL;
> > > > +		}
> > > > +
> > > > +	kobject_put(nhs->hugepages_kobj);
> > > > +	nhs->hugepages_kobj = NULL;
> > > > +}
> > > > +
> > > > +static void hugetlb_unregister_all_nodes(void)
> > > > +{
> > > > +	int nid;
> > > > +
> > > > +	for (nid = 0; nid < nr_node_ids; nid++)
> > > > +		hugetlb_unregister_node(&node_devices[nid]);
> > > > +
> > > > +	register_hugetlbfs_with_node(NULL, NULL);
> > > > +}
> > > > +
> > > > +void hugetlb_register_node(struct node *node)
> > > > +{
> > > > +	struct hstate *h;
> > > > +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> > > > +	int err;
> > > > +
> > > > +	if (nhs->hugepages_kobj)
> > > > +		return;		/* already allocated */
> > > > +
> > > > +	nhs->hugepages_kobj = kobject_create_and_add("hugepages",
> > > > +							&node->sysdev.kobj);
> > > > +	if (!nhs->hugepages_kobj)
> > > > +		return;
> > > > +
> > > > +	for_each_hstate(h) {
> > > > +		err = hugetlb_sysfs_add_hstate(h, nhs->hugepages_kobj,
> > > > +						nhs->hstate_kobjs,
> > > > +						&per_node_hstate_attr_group);
> > > > +		if (err) {
> > > > +			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
> > > > +					" for node %d\n",
> > > > +						h->name, node->sysdev.id);
> > > 
> > > Maybe add `err' to the printk so we know whether it was an -ENOMEM 
> > > condition or sysfs problem?
> > 
> > Just the raw negative number?
> > 
> 
> Sure.  I'm making the assumption that the printk is actually necessary in 
> the first place, which is rather unorthodox for functions that can 
> otherwise silently recover by unregistering the attribute.  Using the 
> printk implies you want to know about the failure, yet additional 
> debugging would be necessary to even identify what the failure was without 
> printing the errno.


David:  


I'm going to leave this printk as is.  I want to keep it, because the
global hstate function issues a similar printk [w/o] error code when it
can't add a global hstate.  The original authors considered this
necessary/useful, so I'm following suit.  The -ENOMEM error is returned
by hugetlb_sysfs_add_hstate() when kobject_create_and_add().  If the
kobject creation can fail for reasons other than ENOMEM, then showing
ENOMEM in the message will sometimes be bogus.  If it can only fail due
to lack of memory, then we are adding no additional info.

more below

> 
> > > 
> > > > +			hugetlb_unregister_node(node);
> > > > +			break;
> > > > +		}
> > > > +	}
> > > > +}
> > > > +
> > > > +static void hugetlb_register_all_nodes(void)
> > > > +{
> > > > +	int nid;
> > > > +
> > > > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > > 
> > > Don't you want to do this for all nodes in N_HIGH_MEMORY?  I don't think 
> > > we should be adding attributes for memoryless nodes.
> > 
> > 
> > Well, I wondered about that.  The persistent huge page allocation code
> > is careful to skip over nodes where it can't allow a huge page there,
> > whether or not the node has [any] memory.  So, it's safe to leave it
> > this way.
> 
> It's safe, but it seems inconsistent to allow hugepage attributes to 
> appear in /sys/devices/node/node* for nodes that have no memory.

OK.  I'm going to change this, and replace node_online_map with
node_state[N_HIGH_MEMORY] [or whatever] as a separate patch.  That way,
if someone complains that this doesn't work for memory hot plug, someone
who understands memory hotplut can fix it or we can drop that patch.

> 
> > And, I was worried about the interaction with memory hotplug,
> > as separate from node hotplug.  The current code handles node hot plug,
> > bug I wasn't sure about memory hot-plug within a node.
> 
> If memory hotplug doesn't update nodes_state[N_HIGH_MEMORY] then there are 
> plenty of other users that will currently fail as well.
> 
> > I.e., would the
> > node driver get called to register the attributes in this case?
> 
> Not without a MEM_ONLINE notifier.

That's what I thought.  I'm not up to speed on that area, and have no
bandwidth nor interest to go there, right now.  If you think that's a
show stopper for per node attributes, we can defer this part of the
series.  But, I'd prefer to stick with current "no-op" attributes for
memoryless nodes over that alternative.

> 
> > Maybe
> > that case doesn't exist, so I don't have to worry about it.   I think
> > this is somewhat similar to the top cpuset mems_allowed being set to all
> > possible to cover any subsequently added nodes/memory.
> > 
> 
> That's because the page allocator's zonelists won't try to allocate from a 
> memoryless node and the only hook into the cpuset code in that path is to 
> check whether a nid is set in cpuset_current_mems_allowed.  It's quite 
> different from providing per-node allocation and freeing mechanisms for 
> pages on nodes without memory like this approach.

OK, we have different perspectives on this.  I'm not at all offended by
the no-op attributes.  If you're worried about safety, I can check
explicitly in the attribute handlers and bail out early for memoryless
nodes.  Then, should someone hot add memory to the node, it will start
attemping to allocate huge pages when requested.

> 



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-03 20:42   ` Randy Dunlap
@ 2009-09-04 15:23     ` Lee Schermerhorn
  0 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-09-04 15:23 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-mm, akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney

On Thu, 2009-09-03 at 13:42 -0700, Randy Dunlap wrote:
> On Fri, 28 Aug 2009 12:03:51 -0400 Lee Schermerhorn wrote:
> 
> (Thanks for cc:, David.)

Randy:  thanks for the review.  I'll add you to the cc list for
reposting of the series.  I'll make all of the changes you suggest,
except those that you seemed to concede might not be required:

1) surplus vs overcommitted.  The currently exported user space
interface uses both "overcommit" for specifying the limit and "surplus"
for displaying the number of overcommitted pages in use.  I agree that
it's somewhat of a misuse of "surplus" as the count actually indicates
"deficit spending" of huge page resources.

2)  the s/cpu/CPU/.  As you say, that cat is on the run.

<snip>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-03 21:25       ` David Rientjes
@ 2009-09-08 10:44           ` Mel Gorman
  0 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-08 10:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney,
	Randy Dunlap

On Thu, Sep 03, 2009 at 02:25:56PM -0700, David Rientjes wrote:
> On Thu, 3 Sep 2009, Lee Schermerhorn wrote:
> 
> > > > @@ -53,26 +51,25 @@ HugePages_Surp  is short for "surplus,"
> > > >  /proc/filesystems should also show a filesystem of type "hugetlbfs" configured
> > > >  in the kernel.
> > > >  
> > > > -/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
> > > > -pages in the kernel.  Super user can dynamically request more (or free some
> > > > -pre-configured) huge pages.
> > > > -The allocation (or deallocation) of hugetlb pages is possible only if there are
> > > > -enough physically contiguous free pages in system (freeing of huge pages is
> > > > -possible only if there are enough hugetlb pages free that can be transferred
> > > > -back to regular memory pool).
> > > > -
> > > > -Pages that are used as hugetlb pages are reserved inside the kernel and cannot
> > > > -be used for other purposes.
> > > > -
> > > > -Once the kernel with Hugetlb page support is built and running, a user can
> > > > -use either the mmap system call or shared memory system calls to start using
> > > > -the huge pages.  It is required that the system administrator preallocate
> > > > -enough memory for huge page purposes.
> > > > -
> > > > -The administrator can preallocate huge pages on the kernel boot command line by
> > > > -specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
> > > > -requested.  This is the most reliable method for preallocating huge pages as
> > > > -memory has not yet become fragmented.
> > > > +/proc/sys/vm/nr_hugepages indicates the current number of huge pages pre-
> > > > +allocated in the kernel's huge page pool.  These are called "persistent"
> > > > +huge pages.  A user with root privileges can dynamically allocate more or
> > > > +free some persistent huge pages by increasing or decreasing the value of
> > > > +'nr_hugepages'.
> > > > +
> > > 
> > > So they're not necessarily "preallocated" then if they're already in use.
> > 
> > I don't see what in the text you're referring to"  "preallocated" vs
> > "already in use" ???
> > 
> 
> Your new line, "/proc/sys/vm/nr_hugepages indicates the current number of 
> huge pages preallocated in the kernel's huge page pool" doesn't seem 
> correct since pages are not "pre"-allocated if they are used by an 
> application.  Preallocation is only when pages are allocated for a 
> performance optimization in a later hotpath (such as in a slab allocator) 
> or when the allocation cannot be done later in a non-blocking context.  If 
> you were to remove "pre" from that line it would be clear.
> 
> > > Not sure if you need to spell out that they're called "huge page allowed 
> > > nodes," isn't that an implementation detail?  The way Paul Jackson used to 
> > > describe nodes_allowed is "set of allowable nodes," and I can't think of a 
> > > better phrase.  That's also how the cpuset documentation describes them.
> > 
> > I wanted to refer to "huge pages allowed nodes" to differentiate from,
> > e.g., cpusets mems_allowed"--i.e., I wanted the "huge pages" qualifier.
> > I suppose I could introduce the phrase you suggest:  "set of allowable
> > nodes" and emphasize that in this doc, it only refers to nodes from
> > which persistent huge pages will be allocated.
> > 
> 
> It's a different story if you want to use the phrase "allowed nodes" 
> throughout this document to mean "the set of allowed nodes from which to 
> allocate hugepages depending on the allocating task's mempolicy," but I 
> didn't see any future reference to that phrase in your changes anyway.
> 
> > I understand.  However, I do think it's useful to support both a mask
> > [and Mel prefers it be based on mempolicy] and per node attributes.  On
> > some of our platforms, we do want explicit control over the placement of
> > huge pages--e.g., for a data base shared area or such.  So, we can say,
> > "I need <N> huge pages, and I want them on nodes 1, 3, 4 and 5", and
> > then, assuming we start with no huge pages allocated [free them all if
> > this is not the case]:
> > 
> > 	numactl -m 1,3-5 hugeadm --pool-pages-min 2M:<N>
> > 
> > Later, if I decide that maybe I want to adjust the number on node 1, I
> > can:
> > 
> > 	numactl -m 1 --pool-pages-min 2M:{+|-}<count>
> > 
> > or:
> > 
> > 	echo <new-value> >/sys/devices/system/node/node1/hugepages/hugepages-2048KB/nr_hugepages
> > 
> > [Of course, I'd probably do this in a script to avoid all that typing :)]
> > 
> 
> Yes, but the caveat I'm pointing out (and is really clearly described in 
> your documentation changes here) is that existing applications, shell 
> scripts, job schedulers, whatever, which currently free all system 
> hugepages (or do so at a consistent interval down to the surplus 
> value to reclaim memory) will now leak disjoint pages since the freeing is 
> now governed by its mempolicy. 

While this is a possibility, it makes little sense to assume that behaviour. To
be really bitten by the change, the policy used to allocate huge pages needs
to be different than the policy used to free them. This would be a bit
screwy as it would imply the job scheduler allocated pages that would
then be unusable by the job if policies were being obeyed which makes
very little sense.

> If the benefits of doing this 
> significantly outweigh that potential for userspace breakage, I have no 
> objection to it.  I just can't say for certain that it is.
> 

An application depending on memory policies to be ignored is pretty broken
to begin with.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
@ 2009-09-08 10:44           ` Mel Gorman
  0 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-08 10:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, eric.whitney,
	Randy Dunlap

On Thu, Sep 03, 2009 at 02:25:56PM -0700, David Rientjes wrote:
> On Thu, 3 Sep 2009, Lee Schermerhorn wrote:
> 
> > > > @@ -53,26 +51,25 @@ HugePages_Surp  is short for "surplus,"
> > > >  /proc/filesystems should also show a filesystem of type "hugetlbfs" configured
> > > >  in the kernel.
> > > >  
> > > > -/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
> > > > -pages in the kernel.  Super user can dynamically request more (or free some
> > > > -pre-configured) huge pages.
> > > > -The allocation (or deallocation) of hugetlb pages is possible only if there are
> > > > -enough physically contiguous free pages in system (freeing of huge pages is
> > > > -possible only if there are enough hugetlb pages free that can be transferred
> > > > -back to regular memory pool).
> > > > -
> > > > -Pages that are used as hugetlb pages are reserved inside the kernel and cannot
> > > > -be used for other purposes.
> > > > -
> > > > -Once the kernel with Hugetlb page support is built and running, a user can
> > > > -use either the mmap system call or shared memory system calls to start using
> > > > -the huge pages.  It is required that the system administrator preallocate
> > > > -enough memory for huge page purposes.
> > > > -
> > > > -The administrator can preallocate huge pages on the kernel boot command line by
> > > > -specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
> > > > -requested.  This is the most reliable method for preallocating huge pages as
> > > > -memory has not yet become fragmented.
> > > > +/proc/sys/vm/nr_hugepages indicates the current number of huge pages pre-
> > > > +allocated in the kernel's huge page pool.  These are called "persistent"
> > > > +huge pages.  A user with root privileges can dynamically allocate more or
> > > > +free some persistent huge pages by increasing or decreasing the value of
> > > > +'nr_hugepages'.
> > > > +
> > > 
> > > So they're not necessarily "preallocated" then if they're already in use.
> > 
> > I don't see what in the text you're referring to"  "preallocated" vs
> > "already in use" ???
> > 
> 
> Your new line, "/proc/sys/vm/nr_hugepages indicates the current number of 
> huge pages preallocated in the kernel's huge page pool" doesn't seem 
> correct since pages are not "pre"-allocated if they are used by an 
> application.  Preallocation is only when pages are allocated for a 
> performance optimization in a later hotpath (such as in a slab allocator) 
> or when the allocation cannot be done later in a non-blocking context.  If 
> you were to remove "pre" from that line it would be clear.
> 
> > > Not sure if you need to spell out that they're called "huge page allowed 
> > > nodes," isn't that an implementation detail?  The way Paul Jackson used to 
> > > describe nodes_allowed is "set of allowable nodes," and I can't think of a 
> > > better phrase.  That's also how the cpuset documentation describes them.
> > 
> > I wanted to refer to "huge pages allowed nodes" to differentiate from,
> > e.g., cpusets mems_allowed"--i.e., I wanted the "huge pages" qualifier.
> > I suppose I could introduce the phrase you suggest:  "set of allowable
> > nodes" and emphasize that in this doc, it only refers to nodes from
> > which persistent huge pages will be allocated.
> > 
> 
> It's a different story if you want to use the phrase "allowed nodes" 
> throughout this document to mean "the set of allowed nodes from which to 
> allocate hugepages depending on the allocating task's mempolicy," but I 
> didn't see any future reference to that phrase in your changes anyway.
> 
> > I understand.  However, I do think it's useful to support both a mask
> > [and Mel prefers it be based on mempolicy] and per node attributes.  On
> > some of our platforms, we do want explicit control over the placement of
> > huge pages--e.g., for a data base shared area or such.  So, we can say,
> > "I need <N> huge pages, and I want them on nodes 1, 3, 4 and 5", and
> > then, assuming we start with no huge pages allocated [free them all if
> > this is not the case]:
> > 
> > 	numactl -m 1,3-5 hugeadm --pool-pages-min 2M:<N>
> > 
> > Later, if I decide that maybe I want to adjust the number on node 1, I
> > can:
> > 
> > 	numactl -m 1 --pool-pages-min 2M:{+|-}<count>
> > 
> > or:
> > 
> > 	echo <new-value> >/sys/devices/system/node/node1/hugepages/hugepages-2048KB/nr_hugepages
> > 
> > [Of course, I'd probably do this in a script to avoid all that typing :)]
> > 
> 
> Yes, but the caveat I'm pointing out (and is really clearly described in 
> your documentation changes here) is that existing applications, shell 
> scripts, job schedulers, whatever, which currently free all system 
> hugepages (or do so at a consistent interval down to the surplus 
> value to reclaim memory) will now leak disjoint pages since the freeing is 
> now governed by its mempolicy. 

While this is a possibility, it makes little sense to assume that behaviour. To
be really bitten by the change, the policy used to allocate huge pages needs
to be different than the policy used to free them. This would be a bit
screwy as it would imply the job scheduler allocated pages that would
then be unusable by the job if policies were being obeyed which makes
very little sense.

> If the benefits of doing this 
> significantly outweigh that potential for userspace breakage, I have no 
> objection to it.  I just can't say for certain that it is.
> 

An application depending on memory policies to be ignored is pretty broken
to begin with.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-08 10:44           ` Mel Gorman
  (?)
@ 2009-09-08 19:51           ` David Rientjes
  2009-09-08 20:04               ` Mel Gorman
  -1 siblings, 1 reply; 81+ messages in thread
From: David Rientjes @ 2009-09-08 19:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Tue, 8 Sep 2009, Mel Gorman wrote:

> > Yes, but the caveat I'm pointing out (and is really clearly described in 
> > your documentation changes here) is that existing applications, shell 
> > scripts, job schedulers, whatever, which currently free all system 
> > hugepages (or do so at a consistent interval down to the surplus 
> > value to reclaim memory) will now leak disjoint pages since the freeing is 
> > now governed by its mempolicy. 
> 
> While this is a possibility, it makes little sense to assume that behaviour. To
> be really bitten by the change, the policy used to allocate huge pages needs
> to be different than the policy used to free them. This would be a bit
> screwy as it would imply the job scheduler allocated pages that would
> then be unusable by the job if policies were being obeyed which makes
> very little sense.
> 

Au contraire, the hugepages= kernel parameter is not restricted to any 
mempolicy.

> > If the benefits of doing this 
> > significantly outweigh that potential for userspace breakage, I have no 
> > objection to it.  I just can't say for certain that it is.
> > 
> 
> An application depending on memory policies to be ignored is pretty broken
> to begin with.
> 

Theoretically, yes, but not in practice.  /proc/sys/vm/nr_hugepages has 
always allocated and freed with disregard to current's mempolicy prior to 
this patchset and it wouldn't be "broken" for an application to assume 
that it will continue to do so.  More broken is assuming that such an 
application should have been written to change its mempolicy to include 
all nodes that have hugepages prior to freeing because someday the kernel 
would change to do mempolicy-restricted hugepage freeing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-08 19:51           ` David Rientjes
@ 2009-09-08 20:04               ` Mel Gorman
  0 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-08 20:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Tue, Sep 08, 2009 at 12:51:48PM -0700, David Rientjes wrote:
> On Tue, 8 Sep 2009, Mel Gorman wrote:
> 
> > > Yes, but the caveat I'm pointing out (and is really clearly described in 
> > > your documentation changes here) is that existing applications, shell 
> > > scripts, job schedulers, whatever, which currently free all system 
> > > hugepages (or do so at a consistent interval down to the surplus 
> > > value to reclaim memory) will now leak disjoint pages since the freeing is 
> > > now governed by its mempolicy. 
> > 
> > While this is a possibility, it makes little sense to assume that behaviour. To
> > be really bitten by the change, the policy used to allocate huge pages needs
> > to be different than the policy used to free them. This would be a bit
> > screwy as it would imply the job scheduler allocated pages that would
> > then be unusable by the job if policies were being obeyed which makes
> > very little sense.
> > 
> 
> Au contraire, the hugepages= kernel parameter is not restricted to any 
> mempolicy.
> 

I'm not seeing how it would be considered symmetric to compare allocation
at a boot-time parameter with freeing happening at run-time within a mempolicy.
It's more plausible to me that such a scenario will having the freeing
thread either with no policy or the ability to run with no policy
applied.

> > > If the benefits of doing this 
> > > significantly outweigh that potential for userspace breakage, I have no 
> > > objection to it.  I just can't say for certain that it is.
> > > 
> > 
> > An application depending on memory policies to be ignored is pretty broken
> > to begin with.
> > 
> 
> Theoretically, yes, but not in practice.  /proc/sys/vm/nr_hugepages has 
> always allocated and freed with disregard to current's mempolicy prior to 
> this patchset and it wouldn't be "broken" for an application to assume 
> that it will continue to do so. 

I don't think we're going to agree on this one. I find it very unlikely
that the process doing the allocation and freeing is going to have
different memory policies.

> More broken is assuming that such an 
> application should have been written to change its mempolicy to include 
> all nodes that have hugepages prior to freeing because someday the kernel 
> would change to do mempolicy-restricted hugepage freeing.
> 

It wouldn't have to be rewritten. At very worst, rearranged at startup
to have the same policy when allocating and freeing.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
@ 2009-09-08 20:04               ` Mel Gorman
  0 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-08 20:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Tue, Sep 08, 2009 at 12:51:48PM -0700, David Rientjes wrote:
> On Tue, 8 Sep 2009, Mel Gorman wrote:
> 
> > > Yes, but the caveat I'm pointing out (and is really clearly described in 
> > > your documentation changes here) is that existing applications, shell 
> > > scripts, job schedulers, whatever, which currently free all system 
> > > hugepages (or do so at a consistent interval down to the surplus 
> > > value to reclaim memory) will now leak disjoint pages since the freeing is 
> > > now governed by its mempolicy. 
> > 
> > While this is a possibility, it makes little sense to assume that behaviour. To
> > be really bitten by the change, the policy used to allocate huge pages needs
> > to be different than the policy used to free them. This would be a bit
> > screwy as it would imply the job scheduler allocated pages that would
> > then be unusable by the job if policies were being obeyed which makes
> > very little sense.
> > 
> 
> Au contraire, the hugepages= kernel parameter is not restricted to any 
> mempolicy.
> 

I'm not seeing how it would be considered symmetric to compare allocation
at a boot-time parameter with freeing happening at run-time within a mempolicy.
It's more plausible to me that such a scenario will having the freeing
thread either with no policy or the ability to run with no policy
applied.

> > > If the benefits of doing this 
> > > significantly outweigh that potential for userspace breakage, I have no 
> > > objection to it.  I just can't say for certain that it is.
> > > 
> > 
> > An application depending on memory policies to be ignored is pretty broken
> > to begin with.
> > 
> 
> Theoretically, yes, but not in practice.  /proc/sys/vm/nr_hugepages has 
> always allocated and freed with disregard to current's mempolicy prior to 
> this patchset and it wouldn't be "broken" for an application to assume 
> that it will continue to do so. 

I don't think we're going to agree on this one. I find it very unlikely
that the process doing the allocation and freeing is going to have
different memory policies.

> More broken is assuming that such an 
> application should have been written to change its mempolicy to include 
> all nodes that have hugepages prior to freeing because someday the kernel 
> would change to do mempolicy-restricted hugepage freeing.
> 

It wouldn't have to be rewritten. At very worst, rearranged at startup
to have the same policy when allocating and freeing.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-08 20:04               ` Mel Gorman
  (?)
@ 2009-09-08 20:18               ` David Rientjes
  2009-09-08 21:41                   ` Mel Gorman
  -1 siblings, 1 reply; 81+ messages in thread
From: David Rientjes @ 2009-09-08 20:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Tue, 8 Sep 2009, Mel Gorman wrote:

> > Au contraire, the hugepages= kernel parameter is not restricted to any 
> > mempolicy.
> > 
> 
> I'm not seeing how it would be considered symmetric to compare allocation
> at a boot-time parameter with freeing happening at run-time within a mempolicy.
> It's more plausible to me that such a scenario will having the freeing
> thread either with no policy or the ability to run with no policy
> applied.
> 

Imagine a cluster of machines that are all treated equally to serve a 
variety of different production jobs.  One of those production jobs 
requires a very high percentage of hugepages.  In fact, its performance 
gain is directly proportional to the number of hugepages allocated.

It is quite plausible for all machines to be booted with hugepages= to 
achieve the maximum number of hugepages that those machines may support.  
Depending on what jobs they will serve, however, those hugepages may 
immediately be freed (or a subset, depending on other smaller jobs that 
may want them.)  If the job scheduler is bound to a mempolicy which does 
not include all nodes with memory, those hugepages are now leaked.  That 
was not the behavior over the past three or four years until this 
patchset.

That example is not dealing in hypotheticals or assumptions on how people 
use hugepages, it's based on reality.  As I said previously, I don't 
necessarily have an objection to that if it can be shown that the 
advantages significantly outweigh the disadvantages.  I'm not sure I see 
the advantage in being implict vs. explicit, however.  Mempolicy 
allocation and freeing is now _implicit_ because its restricted to 
current's mempolicy when it wasn't before, yet node-targeted hugepage 
allocation and freeing is _explicit_ because it's a new interface and on 
the same granularity.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-08 20:18               ` David Rientjes
@ 2009-09-08 21:41                   ` Mel Gorman
  0 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-08 21:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Tue, Sep 08, 2009 at 01:18:01PM -0700, David Rientjes wrote:
> On Tue, 8 Sep 2009, Mel Gorman wrote:
> 
> > > Au contraire, the hugepages= kernel parameter is not restricted to any 
> > > mempolicy.
> > > 
> > 
> > I'm not seeing how it would be considered symmetric to compare allocation
> > at a boot-time parameter with freeing happening at run-time within a mempolicy.
> > It's more plausible to me that such a scenario will having the freeing
> > thread either with no policy or the ability to run with no policy
> > applied.
> > 
> 
> Imagine a cluster of machines that are all treated equally to serve a 
> variety of different production jobs.  One of those production jobs 
> requires a very high percentage of hugepages.  In fact, its performance 
> gain is directly proportional to the number of hugepages allocated.
> 
> It is quite plausible for all machines to be booted with hugepages= to 
> achieve the maximum number of hugepages that those machines may support.  
> Depending on what jobs they will serve, however, those hugepages may 
> immediately be freed (or a subset, depending on other smaller jobs that 
> may want them.)  If the job scheduler is bound to a mempolicy which does 
> not include all nodes with memory, those hugepages are now leaked. 

Why is a job scheduler that is expecting to affect memory on a global
basis running inside a mempolicy that restricts it to a subset of nodes?
It seems inconsistent that an isolated job starting could affect the global
state potentially affecting other jobs starting up.

In addition, if it is the case that the jobs performance is directly
proportional to the number of hugepages it gets access to, why is it starting
up with access to only a subset of the available hugepages? Why is it not
being setup to being the first job to start on a freshly booting machine,
starting on the subset of nodes allowed and requesting the maximum number
of hugepages it needs such that it achieves maximum performance? With the
memory policy approach, it's very straight-forward to do this because all
it has to do is write to nr_hugepages when it starts-up.

> That 
> was not the behavior over the past three or four years until this 
> patchset.
> 

While this is true, I know people have also been bitten by the expectation
that writing to nr_hugepages would obey a memory policy and were surprised
when it didn't happen and sent me whinging emails. It also appeared obvious
to me that it's how the interface should behave even if it wasn't doing it
in practice. Once nr_hugepages obeys memory policies, it's fairly convenient
to size the number of pages on a subset of nodes using numactl - a tool that
people would generally expect to be used when operating on nodes. Hence the
example usage being

numactl -m x,y,z hugeadm --pool-pages-min $PAGESIZE:$NUMPAGES

> That example is not dealing in hypotheticals or assumptions on how people 
> use hugepages, it's based on reality.  As I said previously, I don't 
> necessarily have an objection to that if it can be shown that the 
> advantages significantly outweigh the disadvantages.  I'm not sure I see 
> the advantage in being implict vs. explicit, however. 

The advantage is that with memory policies on nr_hugepages, it's very
convenient to allocate pages within a subset of nodes without worrying about
where exactly those huge pages are being allocated from. It will allocate
them on a round-robin basis allocating more pages on one node over another
if fragmentation requires it rather than shifting the burden to a userspace
application figuring out what nodes might succeed an allocation or shifting
the burden onto the system administrator. It's likely that writing to the
global nr_hugepages within a mempolicy will end up with a more sensible
result than a userspace application dealing with the individual node-specific
nr_hugepages files.

To do the same with the explicit interface, a userspace application
or administrator would have to keep reading the existing nr_hugepages,
writing existing_nr_hugepages+1 to each node in the allowed set, re-reading
to check for allocating failure and round-robining by hand.  This seems
awkward-for-the-sake-of-being-awkward when the kernel is already prefectly
aware of how to round-robin allocate the requested number of nodes allocating
more on one node if necessary.

> Mempolicy 
> allocation and freeing is now _implicit_ because its restricted to 
> current's mempolicy when it wasn't before, yet node-targeted hugepage 
> allocation and freeing is _explicit_ because it's a new interface and on 
> the same granularity.
> 

Arguably because the application was restricted by a memory policy, it
should not be able to operating outside of that policy and be forbidden
from writing to per-node-nr_hugepages outside the allowed set.  However,
that would appear awkward for the sake of it.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
@ 2009-09-08 21:41                   ` Mel Gorman
  0 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-08 21:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Tue, Sep 08, 2009 at 01:18:01PM -0700, David Rientjes wrote:
> On Tue, 8 Sep 2009, Mel Gorman wrote:
> 
> > > Au contraire, the hugepages= kernel parameter is not restricted to any 
> > > mempolicy.
> > > 
> > 
> > I'm not seeing how it would be considered symmetric to compare allocation
> > at a boot-time parameter with freeing happening at run-time within a mempolicy.
> > It's more plausible to me that such a scenario will having the freeing
> > thread either with no policy or the ability to run with no policy
> > applied.
> > 
> 
> Imagine a cluster of machines that are all treated equally to serve a 
> variety of different production jobs.  One of those production jobs 
> requires a very high percentage of hugepages.  In fact, its performance 
> gain is directly proportional to the number of hugepages allocated.
> 
> It is quite plausible for all machines to be booted with hugepages= to 
> achieve the maximum number of hugepages that those machines may support.  
> Depending on what jobs they will serve, however, those hugepages may 
> immediately be freed (or a subset, depending on other smaller jobs that 
> may want them.)  If the job scheduler is bound to a mempolicy which does 
> not include all nodes with memory, those hugepages are now leaked. 

Why is a job scheduler that is expecting to affect memory on a global
basis running inside a mempolicy that restricts it to a subset of nodes?
It seems inconsistent that an isolated job starting could affect the global
state potentially affecting other jobs starting up.

In addition, if it is the case that the jobs performance is directly
proportional to the number of hugepages it gets access to, why is it starting
up with access to only a subset of the available hugepages? Why is it not
being setup to being the first job to start on a freshly booting machine,
starting on the subset of nodes allowed and requesting the maximum number
of hugepages it needs such that it achieves maximum performance? With the
memory policy approach, it's very straight-forward to do this because all
it has to do is write to nr_hugepages when it starts-up.

> That 
> was not the behavior over the past three or four years until this 
> patchset.
> 

While this is true, I know people have also been bitten by the expectation
that writing to nr_hugepages would obey a memory policy and were surprised
when it didn't happen and sent me whinging emails. It also appeared obvious
to me that it's how the interface should behave even if it wasn't doing it
in practice. Once nr_hugepages obeys memory policies, it's fairly convenient
to size the number of pages on a subset of nodes using numactl - a tool that
people would generally expect to be used when operating on nodes. Hence the
example usage being

numactl -m x,y,z hugeadm --pool-pages-min $PAGESIZE:$NUMPAGES

> That example is not dealing in hypotheticals or assumptions on how people 
> use hugepages, it's based on reality.  As I said previously, I don't 
> necessarily have an objection to that if it can be shown that the 
> advantages significantly outweigh the disadvantages.  I'm not sure I see 
> the advantage in being implict vs. explicit, however. 

The advantage is that with memory policies on nr_hugepages, it's very
convenient to allocate pages within a subset of nodes without worrying about
where exactly those huge pages are being allocated from. It will allocate
them on a round-robin basis allocating more pages on one node over another
if fragmentation requires it rather than shifting the burden to a userspace
application figuring out what nodes might succeed an allocation or shifting
the burden onto the system administrator. It's likely that writing to the
global nr_hugepages within a mempolicy will end up with a more sensible
result than a userspace application dealing with the individual node-specific
nr_hugepages files.

To do the same with the explicit interface, a userspace application
or administrator would have to keep reading the existing nr_hugepages,
writing existing_nr_hugepages+1 to each node in the allowed set, re-reading
to check for allocating failure and round-robining by hand.  This seems
awkward-for-the-sake-of-being-awkward when the kernel is already prefectly
aware of how to round-robin allocate the requested number of nodes allocating
more on one node if necessary.

> Mempolicy 
> allocation and freeing is now _implicit_ because its restricted to 
> current's mempolicy when it wasn't before, yet node-targeted hugepage 
> allocation and freeing is _explicit_ because it's a new interface and on 
> the same granularity.
> 

Arguably because the application was restricted by a memory policy, it
should not be able to operating outside of that policy and be forbidden
from writing to per-node-nr_hugepages outside the allowed set.  However,
that would appear awkward for the sake of it.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-08 21:41                   ` Mel Gorman
  (?)
@ 2009-09-08 22:54                   ` David Rientjes
  2009-09-09  8:16                       ` Mel Gorman
  -1 siblings, 1 reply; 81+ messages in thread
From: David Rientjes @ 2009-09-08 22:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Tue, 8 Sep 2009, Mel Gorman wrote:

> Why is a job scheduler that is expecting to affect memory on a global
> basis running inside a mempolicy that restricts it to a subset of nodes?

Because hugepage allocation and freeing has always been on a global basis, 
there was no previous restriction.

> In addition, if it is the case that the jobs performance is directly
> proportional to the number of hugepages it gets access to, why is it starting
> up with access to only a subset of the available hugepages?

It's not, that particular job is allocated the entire machine other than 
the job scheduler.  The point is that the machine pool treats each machine 
equally so while they are all booted with hugepages=<large number>, 
machines that don't serve this application immediately free them.  If 
hugepages cannot be dynamically allocated up to a certain threshold that 
the application requires, a reboot is necessary if no other machines are 
available.

Since the only way to achieve the absolute maximum number of hugepages 
possible on a machine is through the command line, it's completely 
reasonable to use it on every boot and then subsequently free them when 
they're unnecessary.

> Why is it not
> being setup to being the first job to start on a freshly booting machine,
> starting on the subset of nodes allowed and requesting the maximum number
> of hugepages it needs such that it achieves maximum performance? With the
> memory policy approach, it's very straight-forward to do this because all
> it has to do is write to nr_hugepages when it starts-up.
> 

The job scheduler will always be the first job to start on the machine and 
it may have a mempolicy of its own.  When it attempts to free all 
hugepages allocated by the command line, it will then leak pages because 
of these changes.

Arguing that applications should always dynamically allocate their 
hugepages on their own subset of nodes when they are started is wishful 
thinking: it's much easier to allocate as many as possible on the command 
line and then free unnecessary hugepages than allocate on the mempolicy's 
nodes for the maximal number of hugepages.

> > That 
> > was not the behavior over the past three or four years until this 
> > patchset.
> > 
> 
> While this is true, I know people have also been bitten by the expectation
> that writing to nr_hugepages would obey a memory policy and were surprised
> when it didn't happen and sent me whinging emails.

Ok, but I doubt the inverse is true: people probably haven't emailed you 
letting you know that they've coded their application for hugepage 
allocations based on how the kernel has implemented it for years, so 
there's no context.  That's the population that I'm worried about (and is 
also solved by node-targeted hugepage allocation, btw).

 [ Such users who have emailed you could always use cpusets for the
   desired effect, it's not like there isn't a solution. ]

> It also appeared obvious
> to me that it's how the interface should behave even if it wasn't doing it
> in practice. Once nr_hugepages obeys memory policies, it's fairly convenient
> to size the number of pages on a subset of nodes using numactl - a tool that
> people would generally expect to be used when operating on nodes. Hence the
> example usage being
> 
> numactl -m x,y,z hugeadm --pool-pages-min $PAGESIZE:$NUMPAGES
> 

We need a new mempolicy flag, then, such as MPOL_F_HUGEPAGES to constrain 
hugepage allocation and freeing via the global tunable to such a 
mempolicy.

> The advantage is that with memory policies on nr_hugepages, it's very
> convenient to allocate pages within a subset of nodes without worrying about
> where exactly those huge pages are being allocated from. It will allocate
> them on a round-robin basis allocating more pages on one node over another
> if fragmentation requires it rather than shifting the burden to a userspace
> application figuring out what nodes might succeed an allocation or shifting
> the burden onto the system administrator.

That "burden" is usually for good reason: if my MPOL_INTERLEAVE policy 
gets hugepages that are relatively unbalanced across a set of nodes that I 
arbitrarily picked, my interleave it's going to be nearly as optimized 
compared to what userspace can allocate with the node-targeted approach: 
allocate [desired nr of hugepages] / [nr nodes in policy] hugepages via 
/sys/devices/system/node/node*/nr_hugepages, and construct a policy out of 
the nodes that give a true interleave.  That's not a trivial performance 
gain and can rarely be dismissed by simply picking an arbitrary set.

> It's likely that writing to the
> global nr_hugepages within a mempolicy will end up with a more sensible
> result than a userspace application dealing with the individual node-specific
> nr_hugepages files.
> 

Disagree for the interleave example above without complex userspace logic 
to determine how successful hugepage allocation will be based on 
fragmentation.

> To do the same with the explicit interface, a userspace application
> or administrator would have to keep reading the existing nr_hugepages,
> writing existing_nr_hugepages+1 to each node in the allowed set, re-reading
> to check for allocating failure and round-robining by hand.  This seems
> awkward-for-the-sake-of-being-awkward when the kernel is already prefectly
> aware of how to round-robin allocate the requested number of nodes allocating
> more on one node if necessary.
> 

No need for an iteration, simply allocate the ratio I specified above on 
each node and then construct a mempolicy from those nodes based on actual 
results instead of arbitrarily.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-08 22:54                   ` David Rientjes
@ 2009-09-09  8:16                       ` Mel Gorman
  0 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-09  8:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Tue, Sep 08, 2009 at 03:54:40PM -0700, David Rientjes wrote:
> On Tue, 8 Sep 2009, Mel Gorman wrote:
> 
> > Why is a job scheduler that is expecting to affect memory on a global
> > basis running inside a mempolicy that restricts it to a subset of nodes?
> 
> Because hugepage allocation and freeing has always been on a global basis, 
> there was no previous restriction.
> 

And to beat a dead horse, it does make sense that an application
allocating hugepages obey memory policies. It does with dynamic hugepage
resizing for example. It should have been done years ago and
unfortunately wasn't but it's not the first time that the behaviour of
hugepages differed from the core VM.

> > In addition, if it is the case that the jobs performance is directly
> > proportional to the number of hugepages it gets access to, why is it starting
> > up with access to only a subset of the available hugepages?
> 
> It's not, that particular job is allocated the entire machine other than 
> the job scheduler. The point is that the machine pool treats each machine 
> equally so while they are all booted with hugepages=<large number>, 
> machines that don't serve this application immediately free them.  If 
> hugepages cannot be dynamically allocated up to a certain threshold that 
> the application requires, a reboot is necessary if no other machines are 
> available.
> 
> Since the only way to achieve the absolute maximum number of hugepages 
> possible on a machine is through the command line, it's completely 
> reasonable to use it on every boot and then subsequently free them when 
> they're unnecessary.
> 

But less reasonable that an application within a memory policy be able
to affect memory on a global basis. Why not let it break cpusets or
something else as well?

> > Why is it not
> > being setup to being the first job to start on a freshly booting machine,
> > starting on the subset of nodes allowed and requesting the maximum number
> > of hugepages it needs such that it achieves maximum performance? With the
> > memory policy approach, it's very straight-forward to do this because all
> > it has to do is write to nr_hugepages when it starts-up.
> > 
> 
> The job scheduler will always be the first job to start on the machine and 
> it may have a mempolicy of its own.  When it attempts to free all 
> hugepages allocated by the command line, it will then leak pages because 
> of these changes.
> 
> Arguing that applications should always dynamically allocate their 
> hugepages on their own subset of nodes when they are started is wishful 
> thinking: it's much easier to allocate as many as possible on the command 
> line and then free unnecessary hugepages than allocate on the mempolicy's 
> nodes for the maximal number of hugepages.
> 

Only in your particular case where you're willing to reboot the machine to
satisfy a jobs hugepage requirement. This is not always the situation. On
shared-machines, there can be many jobs running, each with different hugepage
requirements. The objective of things like anti-fragmentation, lumpy reclaim
and the like was to allow these sort of jobs to allocate the pages they
need at run-time. In the event these jobs are running on a subset of nodes,
it's of benefit to have nr_hugepages obey memory policies or else userspace
or the administrator has to try a number of different tricks to get the
hugepages they need on the nodes they want.

> > > That 
> > > was not the behavior over the past three or four years until this 
> > > patchset.
> > > 
> > 
> > While this is true, I know people have also been bitten by the expectation
> > that writing to nr_hugepages would obey a memory policy and were surprised
> > when it didn't happen and sent me whinging emails.
> 
> Ok, but I doubt the inverse is true: people probably haven't emailed you 
> letting you know that they've coded their application for hugepage 
> allocations based on how the kernel has implemented it for years, so 
> there's no context. 

They didn't code their application specifically to this case. What happened
is that their jobs needed to run on a subset of nodes and they wanted the
hugepages only available on those nodes. They wrote the value under a memory
policy to nr_hugepages and were suprised when that didn't work. cpusets were
not an obvious choice.

> That's the population that I'm worried about (and is 
> also solved by node-targeted hugepage allocation, btw).
> 

I disagree because it pushes the burden of interleaving to the userspace
application or the administrator, something the kernel can and is able
to deal with.

>  [ Such users who have emailed you could always use cpusets for the
>    desired effect, it's not like there isn't a solution. ]
>  

Which is very convulated. numactl is the expected administrative interface
to restrict allocations on a set of nodes, not cpusets.

> > It also appeared obvious
> > to me that it's how the interface should behave even if it wasn't doing it
> > in practice. Once nr_hugepages obeys memory policies, it's fairly convenient
> > to size the number of pages on a subset of nodes using numactl - a tool that
> > people would generally expect to be used when operating on nodes. Hence the
> > example usage being
> > 
> > numactl -m x,y,z hugeadm --pool-pages-min $PAGESIZE:$NUMPAGES
> > 
> 
> We need a new mempolicy flag, then, such as MPOL_F_HUGEPAGES to constrain 
> hugepage allocation and freeing via the global tunable to such a 
> mempolicy.
> 

This would be somewhat inconsistent. When dynamic hugepage pool resizing
is enabled, the application obeys the memory policy that is in place.
Your suggested policy would only apply when nr_hugepages is being
written to.

It would also appear as duplicated and redundant functionality to numactl
because it would have --interleave meaning interleaving and --hugepages
meaning interleave but only when nr_hugepages is being written to.

> > The advantage is that with memory policies on nr_hugepages, it's very
> > convenient to allocate pages within a subset of nodes without worrying about
> > where exactly those huge pages are being allocated from. It will allocate
> > them on a round-robin basis allocating more pages on one node over another
> > if fragmentation requires it rather than shifting the burden to a userspace
> > application figuring out what nodes might succeed an allocation or shifting
> > the burden onto the system administrator.
> 
> That "burden" is usually for good reason: if my MPOL_INTERLEAVE policy 
> gets hugepages that are relatively unbalanced across a set of nodes that I 
> arbitrarily picked, my interleave it's going to be nearly as optimized 
> compared to what userspace can allocate with the node-targeted approach: 
> allocate [desired nr of hugepages] / [nr nodes in policy] hugepages via 
> /sys/devices/system/node/node*/nr_hugepages, and construct a policy out of 
> the nodes that give a true interleave.

Except that knowledge and awareness of having to do this is pushed out
to userspace and the system administrator, when again it's something
the kernel can trivially do on their behalf.

> That's not a trivial performance 
> gain and can rarely be dismissed by simply picking an arbitrary set.
> 
> > It's likely that writing to the
> > global nr_hugepages within a mempolicy will end up with a more sensible
> > result than a userspace application dealing with the individual node-specific
> > nr_hugepages files.
> > 
> 
> Disagree for the interleave example above without complex userspace logic 
> to determine how successful hugepage allocation will be based on 
> fragmentation.
> 

It can read the value back for nr_hugepages to see was the total number
of allocation successful. hugeadm does this for example and warns when the
desired number of pages were not allocated (or not freed for that example). It
would not detect if the memory allocations were imbalanced without taking
further steps but it would depend on whether being evenly interleaved was
more important than having the maximum number of hugepages.

> > To do the same with the explicit interface, a userspace application
> > or administrator would have to keep reading the existing nr_hugepages,
> > writing existing_nr_hugepages+1 to each node in the allowed set, re-reading
> > to check for allocating failure and round-robining by hand.  This seems
> > awkward-for-the-sake-of-being-awkward when the kernel is already prefectly
> > aware of how to round-robin allocate the requested number of nodes allocating
> > more on one node if necessary.
> > 
> 
> No need for an iteration, simply allocate the ratio I specified above on 
> each node and then construct a mempolicy from those nodes based on actual 
> results instead of arbitrarily.
> 

They would still need to read back the values, determine if the full
allocation was successful and if not, figure out where it failed and
recalculate. 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
@ 2009-09-09  8:16                       ` Mel Gorman
  0 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-09  8:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Tue, Sep 08, 2009 at 03:54:40PM -0700, David Rientjes wrote:
> On Tue, 8 Sep 2009, Mel Gorman wrote:
> 
> > Why is a job scheduler that is expecting to affect memory on a global
> > basis running inside a mempolicy that restricts it to a subset of nodes?
> 
> Because hugepage allocation and freeing has always been on a global basis, 
> there was no previous restriction.
> 

And to beat a dead horse, it does make sense that an application
allocating hugepages obey memory policies. It does with dynamic hugepage
resizing for example. It should have been done years ago and
unfortunately wasn't but it's not the first time that the behaviour of
hugepages differed from the core VM.

> > In addition, if it is the case that the jobs performance is directly
> > proportional to the number of hugepages it gets access to, why is it starting
> > up with access to only a subset of the available hugepages?
> 
> It's not, that particular job is allocated the entire machine other than 
> the job scheduler. The point is that the machine pool treats each machine 
> equally so while they are all booted with hugepages=<large number>, 
> machines that don't serve this application immediately free them.  If 
> hugepages cannot be dynamically allocated up to a certain threshold that 
> the application requires, a reboot is necessary if no other machines are 
> available.
> 
> Since the only way to achieve the absolute maximum number of hugepages 
> possible on a machine is through the command line, it's completely 
> reasonable to use it on every boot and then subsequently free them when 
> they're unnecessary.
> 

But less reasonable that an application within a memory policy be able
to affect memory on a global basis. Why not let it break cpusets or
something else as well?

> > Why is it not
> > being setup to being the first job to start on a freshly booting machine,
> > starting on the subset of nodes allowed and requesting the maximum number
> > of hugepages it needs such that it achieves maximum performance? With the
> > memory policy approach, it's very straight-forward to do this because all
> > it has to do is write to nr_hugepages when it starts-up.
> > 
> 
> The job scheduler will always be the first job to start on the machine and 
> it may have a mempolicy of its own.  When it attempts to free all 
> hugepages allocated by the command line, it will then leak pages because 
> of these changes.
> 
> Arguing that applications should always dynamically allocate their 
> hugepages on their own subset of nodes when they are started is wishful 
> thinking: it's much easier to allocate as many as possible on the command 
> line and then free unnecessary hugepages than allocate on the mempolicy's 
> nodes for the maximal number of hugepages.
> 

Only in your particular case where you're willing to reboot the machine to
satisfy a jobs hugepage requirement. This is not always the situation. On
shared-machines, there can be many jobs running, each with different hugepage
requirements. The objective of things like anti-fragmentation, lumpy reclaim
and the like was to allow these sort of jobs to allocate the pages they
need at run-time. In the event these jobs are running on a subset of nodes,
it's of benefit to have nr_hugepages obey memory policies or else userspace
or the administrator has to try a number of different tricks to get the
hugepages they need on the nodes they want.

> > > That 
> > > was not the behavior over the past three or four years until this 
> > > patchset.
> > > 
> > 
> > While this is true, I know people have also been bitten by the expectation
> > that writing to nr_hugepages would obey a memory policy and were surprised
> > when it didn't happen and sent me whinging emails.
> 
> Ok, but I doubt the inverse is true: people probably haven't emailed you 
> letting you know that they've coded their application for hugepage 
> allocations based on how the kernel has implemented it for years, so 
> there's no context. 

They didn't code their application specifically to this case. What happened
is that their jobs needed to run on a subset of nodes and they wanted the
hugepages only available on those nodes. They wrote the value under a memory
policy to nr_hugepages and were suprised when that didn't work. cpusets were
not an obvious choice.

> That's the population that I'm worried about (and is 
> also solved by node-targeted hugepage allocation, btw).
> 

I disagree because it pushes the burden of interleaving to the userspace
application or the administrator, something the kernel can and is able
to deal with.

>  [ Such users who have emailed you could always use cpusets for the
>    desired effect, it's not like there isn't a solution. ]
>  

Which is very convulated. numactl is the expected administrative interface
to restrict allocations on a set of nodes, not cpusets.

> > It also appeared obvious
> > to me that it's how the interface should behave even if it wasn't doing it
> > in practice. Once nr_hugepages obeys memory policies, it's fairly convenient
> > to size the number of pages on a subset of nodes using numactl - a tool that
> > people would generally expect to be used when operating on nodes. Hence the
> > example usage being
> > 
> > numactl -m x,y,z hugeadm --pool-pages-min $PAGESIZE:$NUMPAGES
> > 
> 
> We need a new mempolicy flag, then, such as MPOL_F_HUGEPAGES to constrain 
> hugepage allocation and freeing via the global tunable to such a 
> mempolicy.
> 

This would be somewhat inconsistent. When dynamic hugepage pool resizing
is enabled, the application obeys the memory policy that is in place.
Your suggested policy would only apply when nr_hugepages is being
written to.

It would also appear as duplicated and redundant functionality to numactl
because it would have --interleave meaning interleaving and --hugepages
meaning interleave but only when nr_hugepages is being written to.

> > The advantage is that with memory policies on nr_hugepages, it's very
> > convenient to allocate pages within a subset of nodes without worrying about
> > where exactly those huge pages are being allocated from. It will allocate
> > them on a round-robin basis allocating more pages on one node over another
> > if fragmentation requires it rather than shifting the burden to a userspace
> > application figuring out what nodes might succeed an allocation or shifting
> > the burden onto the system administrator.
> 
> That "burden" is usually for good reason: if my MPOL_INTERLEAVE policy 
> gets hugepages that are relatively unbalanced across a set of nodes that I 
> arbitrarily picked, my interleave it's going to be nearly as optimized 
> compared to what userspace can allocate with the node-targeted approach: 
> allocate [desired nr of hugepages] / [nr nodes in policy] hugepages via 
> /sys/devices/system/node/node*/nr_hugepages, and construct a policy out of 
> the nodes that give a true interleave.

Except that knowledge and awareness of having to do this is pushed out
to userspace and the system administrator, when again it's something
the kernel can trivially do on their behalf.

> That's not a trivial performance 
> gain and can rarely be dismissed by simply picking an arbitrary set.
> 
> > It's likely that writing to the
> > global nr_hugepages within a mempolicy will end up with a more sensible
> > result than a userspace application dealing with the individual node-specific
> > nr_hugepages files.
> > 
> 
> Disagree for the interleave example above without complex userspace logic 
> to determine how successful hugepage allocation will be based on 
> fragmentation.
> 

It can read the value back for nr_hugepages to see was the total number
of allocation successful. hugeadm does this for example and warns when the
desired number of pages were not allocated (or not freed for that example). It
would not detect if the memory allocations were imbalanced without taking
further steps but it would depend on whether being evenly interleaved was
more important than having the maximum number of hugepages.

> > To do the same with the explicit interface, a userspace application
> > or administrator would have to keep reading the existing nr_hugepages,
> > writing existing_nr_hugepages+1 to each node in the allowed set, re-reading
> > to check for allocating failure and round-robining by hand.  This seems
> > awkward-for-the-sake-of-being-awkward when the kernel is already prefectly
> > aware of how to round-robin allocate the requested number of nodes allocating
> > more on one node if necessary.
> > 
> 
> No need for an iteration, simply allocate the ratio I specified above on 
> each node and then construct a mempolicy from those nodes based on actual 
> results instead of arbitrarily.
> 

They would still need to read back the values, determine if the full
allocation was successful and if not, figure out where it failed and
recalculate. 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-09  8:16                       ` Mel Gorman
  (?)
@ 2009-09-09 20:44                       ` David Rientjes
  2009-09-10 12:26                           ` Mel Gorman
  -1 siblings, 1 reply; 81+ messages in thread
From: David Rientjes @ 2009-09-09 20:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Wed, 9 Sep 2009, Mel Gorman wrote:

> And to beat a dead horse, it does make sense that an application
> allocating hugepages obey memory policies. It does with dynamic hugepage
> resizing for example. It should have been done years ago and
> unfortunately wasn't but it's not the first time that the behaviour of
> hugepages differed from the core VM.
> 

I agree completely, I'm certainly not defending the current implementation 
as a sound design and I too would have preferred that it have done the 
same as Lee's patchset from the very beginning.  The issue I'm raising is 
that while we both agree the current behavior is suboptimal and confusing, 
it is the long-standing kernel behavior.  There are applications out there 
that are written to allocate and free hugepages and now changing the pool 
from which they can allocate or free to could be problematic.

I'm personally fine with the breakage since I'm aware of this discussion 
and can easily fix it in userspace.  I'm more concerned about others 
leaking hugepages or having their boot scripts break because they are 
allocating far fewer hugepages than before.  The documentation 
(Documentation/vm/hugetlbpage.txt) has always said 
/proc/sys/vm/nr_hugepaegs affects hugepages on a system level and now that 
it's changed, I think it should be done explicitly with a new flag than 
implicitly.

Would you explain why introducing a new mempolicy flag, MPOL_F_HUGEPAGES, 
and only using the new behavior when this is set would be inconsistent or 
inadvisible?  Since this is a new behavior that will differ from the 
long-standing default, it seems like it warrants a new mempolicy flag to 
avoid all userspace breakage and make hugepage allocation and freeing with 
an underlying mempolicy explicit.

This would address your audience that have been (privately) emailing you 
while confused about why the hugepages being allocated from a global 
tunable wouldn't be confined to their mempolicy.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-09 20:44                       ` David Rientjes
@ 2009-09-10 12:26                           ` Mel Gorman
  0 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-10 12:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Wed, Sep 09, 2009 at 01:44:28PM -0700, David Rientjes wrote:
> On Wed, 9 Sep 2009, Mel Gorman wrote:
> 
> > And to beat a dead horse, it does make sense that an application
> > allocating hugepages obey memory policies. It does with dynamic hugepage
> > resizing for example. It should have been done years ago and
> > unfortunately wasn't but it's not the first time that the behaviour of
> > hugepages differed from the core VM.
> > 
> 
> I agree completely, I'm certainly not defending the current implementation 
> as a sound design and I too would have preferred that it have done the 
> same as Lee's patchset from the very beginning.  The issue I'm raising is 
> that while we both agree the current behavior is suboptimal and confusing, 
> it is the long-standing kernel behavior.  There are applications out there 
> that are written to allocate and free hugepages and now changing the pool 
> from which they can allocate or free to could be problematic.
> 

While I doubt there are many, the counter-example of one is not
something I can wish or wave away.

> I'm personally fine with the breakage since I'm aware of this discussion 
> and can easily fix it in userspace.  I'm more concerned about others 
> leaking hugepages or having their boot scripts break because they are 
> allocating far fewer hugepages than before. 

I just find it very improbably that system-wide maintenance or boot-up
processes are running within restricted environments that then expect to
have system-wide capabilities but it wouldn't be the first time something
unusual was implemented.

> The documentation 
> (Documentation/vm/hugetlbpage.txt) has always said 
> /proc/sys/vm/nr_hugepaegs affects hugepages on a system level and now that 
> it's changed, I think it should be done explicitly with a new flag than 
> implicitly.
> 
> Would you explain why introducing a new mempolicy flag, MPOL_F_HUGEPAGES, 
> and only using the new behavior when this is set would be inconsistent or 
> inadvisible?

I already explained this. The interface in numactl would look weird. There
would be an --interleave switch and a --hugepages-interleave that only
applies to nr_hugepages. The smarts could be in hugeadm to apply the mask
when --pool-pages-min is specified but that wouldn't help scripts that are
still using echo.

> Since this is a new behavior that will differ from the 
> long-standing default, it seems like it warrants a new mempolicy flag to 
> avoid all userspace breakage and make hugepage allocation and freeing with 
> an underlying mempolicy explicit.
> 
> This would address your audience that have been (privately) emailing you 
> while confused about why the hugepages being allocated from a global 
> tunable wouldn't be confined to their mempolicy.
> 

I hate to have to do this, but how about nr_hugepages which acts
system-wide as it did traditionally and nr_hugepages_mempolicy that obeys
policies? Something like the following untested patch. It would be fairly
trivial for me to implement a --obey-mempolicies switch for hugeadm which
works in conjunction with --pool--pages-min and less likely to cause confusion
than --hugepages-interleave in numactl.

Sorry the patch is untested. I can't hold of a NUMA machine at the moment
and fake NUMA support sucks far worse than I expected it to.

==== BEGIN PATCH ====

[PATCH] Optionally use a memory policy when tuning the size of the static hugepage pool

Patch "derive huge pages nodes allowed from task mempolicy" brought
huge page support more in line with the core VM in that tuning the size
of the static huge page pool would obey memory policies. Using this,
administrators could interleave allocation of huge pages from a subset
of nodes. This is consistent with how dynamic hugepage pool resizing
works and how hugepages get allocated to applications at run-time.

However, it was pointed out that scripts may exist that depend on being
able to drain all hugepages via /proc/sys/vm/nr_hugepages from processes
that are running within a memory policy. This patch adds
/proc/sys/vm/nr_hugepages_mempolicy which when written to will obey
memory policies. /proc/sys/vm/nr_hugepages continues then to be a
system-wide tunable regardless of memory policy.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 include/linux/hugetlb.h |    1 +
 kernel/sysctl.c         |   11 +++++++++++
 mm/hugetlb.c            |   35 ++++++++++++++++++++++++++++++++---
 3 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index fcb1677..fc3a659 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -21,6 +21,7 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
 
 void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
 int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
+int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
 int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
 int hugetlb_treat_movable_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8bac3f5..0637655 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1171,6 +1171,17 @@ static struct ctl_table vm_table[] = {
 		.extra1		= (void *)&hugetlb_zero,
 		.extra2		= (void *)&hugetlb_infinity,
 	 },
+#ifdef CONFIG_NUMA
+	 {
+		.procname	= "nr_hugepages_mempolicy",
+		.data		= NULL,
+		.maxlen		= sizeof(unsigned long),
+		.mode		= 0644,
+		.proc_handler	= &hugetlb_mempolicy_sysctl_handler,
+		.extra1		= (void *)&hugetlb_zero,
+		.extra2		= (void *)&hugetlb_infinity,
+	 },
+#endif
 	 {
 		.ctl_name	= VM_HUGETLB_GROUP,
 		.procname	= "hugetlb_shm_group",
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 83decd6..68abef0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1244,6 +1244,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
 	return ret;
 }
 
+#define NUMA_NO_NODE_OBEY_MEMPOLICY (-2)
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
 static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
 								int nid)
@@ -1254,9 +1255,14 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
 	if (h->order >= MAX_ORDER)
 		return h->max_huge_pages;
 
-	if (nid == NUMA_NO_NODE) {
+	switch (nid) {
+	case NUMA_NO_NODE_OBEY_MEMPOLICY:
 		nodes_allowed = alloc_nodemask_of_mempolicy();
-	} else {
+		break;
+	case NUMA_NO_NODE:
+		nodes_allowed = NULL;
+		break;
+	default:
 		/*
 		 * incoming 'count' is for node 'nid' only, so
 		 * adjust count to global, but restrict alloc/free
@@ -1265,7 +1271,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
 		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
 		nodes_allowed = alloc_nodemask_of_node(nid);
 	}
-	if (!nodes_allowed) {
+	if (!nodes_allowed && nid != NUMA_NO_NODE) {
 		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
 			"for huge page allocation.  Falling back to default.\n",
 			current->comm);
@@ -1796,6 +1802,29 @@ int hugetlb_sysctl_handler(struct ctl_table *table, int write,
 	return 0;
 }
 
+#ifdef CONFIG_NUMA
+int hugetlb_mempolicy_sysctl_handler(struct ctl_table *table, int write,
+			   void __user *buffer,
+			   size_t *length, loff_t *ppos)
+{
+	struct hstate *h = &default_hstate;
+	unsigned long tmp;
+
+	if (!write)
+		tmp = h->max_huge_pages;
+
+	table->data = &tmp;
+	table->maxlen = sizeof(unsigned long);
+	proc_doulongvec_minmax(table, write, buffer, length, ppos);
+
+	if (write)
+		h->max_huge_pages = set_max_huge_pages(h, tmp,
+					NUMA_NO_NODE_OBEY_MEMPOLICY);
+
+	return 0;
+}
+#endif /* CONFIG_NUMA */
+
 int hugetlb_treat_movable_handler(struct ctl_table *table, int write,
 			void __user *buffer,
 			size_t *length, loff_t *ppos)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
@ 2009-09-10 12:26                           ` Mel Gorman
  0 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-10 12:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Wed, Sep 09, 2009 at 01:44:28PM -0700, David Rientjes wrote:
> On Wed, 9 Sep 2009, Mel Gorman wrote:
> 
> > And to beat a dead horse, it does make sense that an application
> > allocating hugepages obey memory policies. It does with dynamic hugepage
> > resizing for example. It should have been done years ago and
> > unfortunately wasn't but it's not the first time that the behaviour of
> > hugepages differed from the core VM.
> > 
> 
> I agree completely, I'm certainly not defending the current implementation 
> as a sound design and I too would have preferred that it have done the 
> same as Lee's patchset from the very beginning.  The issue I'm raising is 
> that while we both agree the current behavior is suboptimal and confusing, 
> it is the long-standing kernel behavior.  There are applications out there 
> that are written to allocate and free hugepages and now changing the pool 
> from which they can allocate or free to could be problematic.
> 

While I doubt there are many, the counter-example of one is not
something I can wish or wave away.

> I'm personally fine with the breakage since I'm aware of this discussion 
> and can easily fix it in userspace.  I'm more concerned about others 
> leaking hugepages or having their boot scripts break because they are 
> allocating far fewer hugepages than before. 

I just find it very improbably that system-wide maintenance or boot-up
processes are running within restricted environments that then expect to
have system-wide capabilities but it wouldn't be the first time something
unusual was implemented.

> The documentation 
> (Documentation/vm/hugetlbpage.txt) has always said 
> /proc/sys/vm/nr_hugepaegs affects hugepages on a system level and now that 
> it's changed, I think it should be done explicitly with a new flag than 
> implicitly.
> 
> Would you explain why introducing a new mempolicy flag, MPOL_F_HUGEPAGES, 
> and only using the new behavior when this is set would be inconsistent or 
> inadvisible?

I already explained this. The interface in numactl would look weird. There
would be an --interleave switch and a --hugepages-interleave that only
applies to nr_hugepages. The smarts could be in hugeadm to apply the mask
when --pool-pages-min is specified but that wouldn't help scripts that are
still using echo.

> Since this is a new behavior that will differ from the 
> long-standing default, it seems like it warrants a new mempolicy flag to 
> avoid all userspace breakage and make hugepage allocation and freeing with 
> an underlying mempolicy explicit.
> 
> This would address your audience that have been (privately) emailing you 
> while confused about why the hugepages being allocated from a global 
> tunable wouldn't be confined to their mempolicy.
> 

I hate to have to do this, but how about nr_hugepages which acts
system-wide as it did traditionally and nr_hugepages_mempolicy that obeys
policies? Something like the following untested patch. It would be fairly
trivial for me to implement a --obey-mempolicies switch for hugeadm which
works in conjunction with --pool--pages-min and less likely to cause confusion
than --hugepages-interleave in numactl.

Sorry the patch is untested. I can't hold of a NUMA machine at the moment
and fake NUMA support sucks far worse than I expected it to.

==== BEGIN PATCH ====

[PATCH] Optionally use a memory policy when tuning the size of the static hugepage pool

Patch "derive huge pages nodes allowed from task mempolicy" brought
huge page support more in line with the core VM in that tuning the size
of the static huge page pool would obey memory policies. Using this,
administrators could interleave allocation of huge pages from a subset
of nodes. This is consistent with how dynamic hugepage pool resizing
works and how hugepages get allocated to applications at run-time.

However, it was pointed out that scripts may exist that depend on being
able to drain all hugepages via /proc/sys/vm/nr_hugepages from processes
that are running within a memory policy. This patch adds
/proc/sys/vm/nr_hugepages_mempolicy which when written to will obey
memory policies. /proc/sys/vm/nr_hugepages continues then to be a
system-wide tunable regardless of memory policy.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 include/linux/hugetlb.h |    1 +
 kernel/sysctl.c         |   11 +++++++++++
 mm/hugetlb.c            |   35 ++++++++++++++++++++++++++++++++---
 3 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index fcb1677..fc3a659 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -21,6 +21,7 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
 
 void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
 int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
+int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
 int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
 int hugetlb_treat_movable_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8bac3f5..0637655 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1171,6 +1171,17 @@ static struct ctl_table vm_table[] = {
 		.extra1		= (void *)&hugetlb_zero,
 		.extra2		= (void *)&hugetlb_infinity,
 	 },
+#ifdef CONFIG_NUMA
+	 {
+		.procname	= "nr_hugepages_mempolicy",
+		.data		= NULL,
+		.maxlen		= sizeof(unsigned long),
+		.mode		= 0644,
+		.proc_handler	= &hugetlb_mempolicy_sysctl_handler,
+		.extra1		= (void *)&hugetlb_zero,
+		.extra2		= (void *)&hugetlb_infinity,
+	 },
+#endif
 	 {
 		.ctl_name	= VM_HUGETLB_GROUP,
 		.procname	= "hugetlb_shm_group",
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 83decd6..68abef0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1244,6 +1244,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
 	return ret;
 }
 
+#define NUMA_NO_NODE_OBEY_MEMPOLICY (-2)
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
 static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
 								int nid)
@@ -1254,9 +1255,14 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
 	if (h->order >= MAX_ORDER)
 		return h->max_huge_pages;
 
-	if (nid == NUMA_NO_NODE) {
+	switch (nid) {
+	case NUMA_NO_NODE_OBEY_MEMPOLICY:
 		nodes_allowed = alloc_nodemask_of_mempolicy();
-	} else {
+		break;
+	case NUMA_NO_NODE:
+		nodes_allowed = NULL;
+		break;
+	default:
 		/*
 		 * incoming 'count' is for node 'nid' only, so
 		 * adjust count to global, but restrict alloc/free
@@ -1265,7 +1271,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
 		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
 		nodes_allowed = alloc_nodemask_of_node(nid);
 	}
-	if (!nodes_allowed) {
+	if (!nodes_allowed && nid != NUMA_NO_NODE) {
 		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
 			"for huge page allocation.  Falling back to default.\n",
 			current->comm);
@@ -1796,6 +1802,29 @@ int hugetlb_sysctl_handler(struct ctl_table *table, int write,
 	return 0;
 }
 
+#ifdef CONFIG_NUMA
+int hugetlb_mempolicy_sysctl_handler(struct ctl_table *table, int write,
+			   void __user *buffer,
+			   size_t *length, loff_t *ppos)
+{
+	struct hstate *h = &default_hstate;
+	unsigned long tmp;
+
+	if (!write)
+		tmp = h->max_huge_pages;
+
+	table->data = &tmp;
+	table->maxlen = sizeof(unsigned long);
+	proc_doulongvec_minmax(table, write, buffer, length, ppos);
+
+	if (write)
+		h->max_huge_pages = set_max_huge_pages(h, tmp,
+					NUMA_NO_NODE_OBEY_MEMPOLICY);
+
+	return 0;
+}
+#endif /* CONFIG_NUMA */
+
 int hugetlb_treat_movable_handler(struct ctl_table *table, int write,
 			void __user *buffer,
 			size_t *length, loff_t *ppos)

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-10 12:26                           ` Mel Gorman
@ 2009-09-11 22:27                             ` David Rientjes
  -1 siblings, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-11 22:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Thu, 10 Sep 2009, Mel Gorman wrote:

> > Would you explain why introducing a new mempolicy flag, MPOL_F_HUGEPAGES, 
> > and only using the new behavior when this is set would be inconsistent or 
> > inadvisible?
> 
> I already explained this. The interface in numactl would look weird. There
> would be an --interleave switch and a --hugepages-interleave that only
> applies to nr_hugepages. The smarts could be in hugeadm to apply the mask
> when --pool-pages-min is specified but that wouldn't help scripts that are
> still using echo.
> 

I don't think we need to address the scripts that are currently using echo 
since they're (hopefully) written to the kernel implementation, i.e. no 
mempolicy restriction on writing to nr_hugepages.

> I hate to have to do this, but how about nr_hugepages which acts
> system-wide as it did traditionally and nr_hugepages_mempolicy that obeys
> policies? Something like the following untested patch. It would be fairly
> trivial for me to implement a --obey-mempolicies switch for hugeadm which
> works in conjunction with --pool--pages-min and less likely to cause confusion
> than --hugepages-interleave in numactl.
> 

I like it.

> Sorry the patch is untested. I can't hold of a NUMA machine at the moment
> and fake NUMA support sucks far worse than I expected it to.
> 

Hmm, I rewrote most of fake NUMA a couple years ago.  What problems are 
you having with it?

> ==== BEGIN PATCH ====
> 
> [PATCH] Optionally use a memory policy when tuning the size of the static hugepage pool
> 
> Patch "derive huge pages nodes allowed from task mempolicy" brought
> huge page support more in line with the core VM in that tuning the size
> of the static huge page pool would obey memory policies. Using this,
> administrators could interleave allocation of huge pages from a subset
> of nodes. This is consistent with how dynamic hugepage pool resizing
> works and how hugepages get allocated to applications at run-time.
> 
> However, it was pointed out that scripts may exist that depend on being
> able to drain all hugepages via /proc/sys/vm/nr_hugepages from processes
> that are running within a memory policy. This patch adds
> /proc/sys/vm/nr_hugepages_mempolicy which when written to will obey
> memory policies. /proc/sys/vm/nr_hugepages continues then to be a
> system-wide tunable regardless of memory policy.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> --- 
>  include/linux/hugetlb.h |    1 +
>  kernel/sysctl.c         |   11 +++++++++++
>  mm/hugetlb.c            |   35 ++++++++++++++++++++++++++++++++---
>  3 files changed, 44 insertions(+), 3 deletions(-)
> 

It'll need an update to Documentation/vm/hugetlb.txt, but this can 
probably be done in one of Lee's patches that edits the same file when he 
reposts.

> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index fcb1677..fc3a659 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -21,6 +21,7 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
>  
>  void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
>  int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> +int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
>  int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
>  int hugetlb_treat_movable_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
>  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 8bac3f5..0637655 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1171,6 +1171,17 @@ static struct ctl_table vm_table[] = {
>  		.extra1		= (void *)&hugetlb_zero,
>  		.extra2		= (void *)&hugetlb_infinity,
>  	 },
> +#ifdef CONFIG_NUMA
> +	 {
> +		.procname	= "nr_hugepages_mempolicy",
> +		.data		= NULL,
> +		.maxlen		= sizeof(unsigned long),
> +		.mode		= 0644,
> +		.proc_handler	= &hugetlb_mempolicy_sysctl_handler,
> +		.extra1		= (void *)&hugetlb_zero,
> +		.extra2		= (void *)&hugetlb_infinity,
> +	 },
> +#endif
>  	 {
>  		.ctl_name	= VM_HUGETLB_GROUP,
>  		.procname	= "hugetlb_shm_group",
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 83decd6..68abef0 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1244,6 +1244,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
>  	return ret;
>  }
>  
> +#define NUMA_NO_NODE_OBEY_MEMPOLICY (-2)
>  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
>  static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
>  								int nid)

I think it would be possible to avoid adding NUMA_NO_NODE_OBEY_MEMPOLICY 
if the nodemask was allocated in the sysctl handler instead and passing it 
into set_max_huge_pages() instead of a nid.  Lee, what do you think?


Other than that, I like this approach because it avoids the potential for 
userspace breakage while adding the new feature in way that avoids 
confusion.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
@ 2009-09-11 22:27                             ` David Rientjes
  0 siblings, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-11 22:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Thu, 10 Sep 2009, Mel Gorman wrote:

> > Would you explain why introducing a new mempolicy flag, MPOL_F_HUGEPAGES, 
> > and only using the new behavior when this is set would be inconsistent or 
> > inadvisible?
> 
> I already explained this. The interface in numactl would look weird. There
> would be an --interleave switch and a --hugepages-interleave that only
> applies to nr_hugepages. The smarts could be in hugeadm to apply the mask
> when --pool-pages-min is specified but that wouldn't help scripts that are
> still using echo.
> 

I don't think we need to address the scripts that are currently using echo 
since they're (hopefully) written to the kernel implementation, i.e. no 
mempolicy restriction on writing to nr_hugepages.

> I hate to have to do this, but how about nr_hugepages which acts
> system-wide as it did traditionally and nr_hugepages_mempolicy that obeys
> policies? Something like the following untested patch. It would be fairly
> trivial for me to implement a --obey-mempolicies switch for hugeadm which
> works in conjunction with --pool--pages-min and less likely to cause confusion
> than --hugepages-interleave in numactl.
> 

I like it.

> Sorry the patch is untested. I can't hold of a NUMA machine at the moment
> and fake NUMA support sucks far worse than I expected it to.
> 

Hmm, I rewrote most of fake NUMA a couple years ago.  What problems are 
you having with it?

> ==== BEGIN PATCH ====
> 
> [PATCH] Optionally use a memory policy when tuning the size of the static hugepage pool
> 
> Patch "derive huge pages nodes allowed from task mempolicy" brought
> huge page support more in line with the core VM in that tuning the size
> of the static huge page pool would obey memory policies. Using this,
> administrators could interleave allocation of huge pages from a subset
> of nodes. This is consistent with how dynamic hugepage pool resizing
> works and how hugepages get allocated to applications at run-time.
> 
> However, it was pointed out that scripts may exist that depend on being
> able to drain all hugepages via /proc/sys/vm/nr_hugepages from processes
> that are running within a memory policy. This patch adds
> /proc/sys/vm/nr_hugepages_mempolicy which when written to will obey
> memory policies. /proc/sys/vm/nr_hugepages continues then to be a
> system-wide tunable regardless of memory policy.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> --- 
>  include/linux/hugetlb.h |    1 +
>  kernel/sysctl.c         |   11 +++++++++++
>  mm/hugetlb.c            |   35 ++++++++++++++++++++++++++++++++---
>  3 files changed, 44 insertions(+), 3 deletions(-)
> 

It'll need an update to Documentation/vm/hugetlb.txt, but this can 
probably be done in one of Lee's patches that edits the same file when he 
reposts.

> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index fcb1677..fc3a659 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -21,6 +21,7 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
>  
>  void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
>  int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> +int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
>  int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
>  int hugetlb_treat_movable_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
>  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 8bac3f5..0637655 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1171,6 +1171,17 @@ static struct ctl_table vm_table[] = {
>  		.extra1		= (void *)&hugetlb_zero,
>  		.extra2		= (void *)&hugetlb_infinity,
>  	 },
> +#ifdef CONFIG_NUMA
> +	 {
> +		.procname	= "nr_hugepages_mempolicy",
> +		.data		= NULL,
> +		.maxlen		= sizeof(unsigned long),
> +		.mode		= 0644,
> +		.proc_handler	= &hugetlb_mempolicy_sysctl_handler,
> +		.extra1		= (void *)&hugetlb_zero,
> +		.extra2		= (void *)&hugetlb_infinity,
> +	 },
> +#endif
>  	 {
>  		.ctl_name	= VM_HUGETLB_GROUP,
>  		.procname	= "hugetlb_shm_group",
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 83decd6..68abef0 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1244,6 +1244,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
>  	return ret;
>  }
>  
> +#define NUMA_NO_NODE_OBEY_MEMPOLICY (-2)
>  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
>  static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
>  								int nid)

I think it would be possible to avoid adding NUMA_NO_NODE_OBEY_MEMPOLICY 
if the nodemask was allocated in the sysctl handler instead and passing it 
into set_max_huge_pages() instead of a nid.  Lee, what do you think?


Other than that, I like this approach because it avoids the potential for 
userspace breakage while adding the new feature in way that avoids 
confusion.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-11 22:27                             ` David Rientjes
  (?)
@ 2009-09-14 13:33                             ` Mel Gorman
  2009-09-14 14:15                                 ` Lee Schermerhorn
  2009-09-14 19:14                                 ` David Rientjes
  -1 siblings, 2 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-14 13:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Fri, Sep 11, 2009 at 03:27:30PM -0700, David Rientjes wrote:
> On Thu, 10 Sep 2009, Mel Gorman wrote:
> 
> > > Would you explain why introducing a new mempolicy flag, MPOL_F_HUGEPAGES, 
> > > and only using the new behavior when this is set would be inconsistent or 
> > > inadvisible?
> > 
> > I already explained this. The interface in numactl would look weird. There
> > would be an --interleave switch and a --hugepages-interleave that only
> > applies to nr_hugepages. The smarts could be in hugeadm to apply the mask
> > when --pool-pages-min is specified but that wouldn't help scripts that are
> > still using echo.
> > 
> 
> I don't think we need to address the scripts that are currently using echo 
> since they're (hopefully) written to the kernel implementation, i.e. no 
> mempolicy restriction on writing to nr_hugepages.
> 

Ok.

> > I hate to have to do this, but how about nr_hugepages which acts
> > system-wide as it did traditionally and nr_hugepages_mempolicy that obeys
> > policies? Something like the following untested patch. It would be fairly
> > trivial for me to implement a --obey-mempolicies switch for hugeadm which
> > works in conjunction with --pool--pages-min and less likely to cause confusion
> > than --hugepages-interleave in numactl.
> > 
> 
> I like it.
> 

Ok, when I get this tested, I'll sent it as a follow-on patch to Lee's
for proper incorporation.

> > Sorry the patch is untested. I can't hold of a NUMA machine at the moment
> > and fake NUMA support sucks far worse than I expected it to.
> > 
> 
> Hmm, I rewrote most of fake NUMA a couple years ago.  What problems are 
> you having with it?
> 

On PPC64, the parameters behave differently. I couldn't convince it to
create more than one NUMA node. On x86-64, the NUMA nodes appeared to
exist and would be visible on /proc/buddyinfo for example but the sysfs
directories for the fake nodes were not created so nr_hugepages couldn't
be examined on a per-node basis for example.

> > ==== BEGIN PATCH ====
> > 
> > [PATCH] Optionally use a memory policy when tuning the size of the static hugepage pool
> > 
> > Patch "derive huge pages nodes allowed from task mempolicy" brought
> > huge page support more in line with the core VM in that tuning the size
> > of the static huge page pool would obey memory policies. Using this,
> > administrators could interleave allocation of huge pages from a subset
> > of nodes. This is consistent with how dynamic hugepage pool resizing
> > works and how hugepages get allocated to applications at run-time.
> > 
> > However, it was pointed out that scripts may exist that depend on being
> > able to drain all hugepages via /proc/sys/vm/nr_hugepages from processes
> > that are running within a memory policy. This patch adds
> > /proc/sys/vm/nr_hugepages_mempolicy which when written to will obey
> > memory policies. /proc/sys/vm/nr_hugepages continues then to be a
> > system-wide tunable regardless of memory policy.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > --- 
> >  include/linux/hugetlb.h |    1 +
> >  kernel/sysctl.c         |   11 +++++++++++
> >  mm/hugetlb.c            |   35 ++++++++++++++++++++++++++++++++---
> >  3 files changed, 44 insertions(+), 3 deletions(-)
> > 
> 
> It'll need an update to Documentation/vm/hugetlb.txt, but this can 
> probably be done in one of Lee's patches that edits the same file when he 
> reposts.
> 

Agreed.

> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index fcb1677..fc3a659 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -21,6 +21,7 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
> >  
> >  void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
> >  int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> > +int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> >  int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> >  int hugetlb_treat_movable_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> >  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index 8bac3f5..0637655 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -1171,6 +1171,17 @@ static struct ctl_table vm_table[] = {
> >  		.extra1		= (void *)&hugetlb_zero,
> >  		.extra2		= (void *)&hugetlb_infinity,
> >  	 },
> > +#ifdef CONFIG_NUMA
> > +	 {
> > +		.procname	= "nr_hugepages_mempolicy",
> > +		.data		= NULL,
> > +		.maxlen		= sizeof(unsigned long),
> > +		.mode		= 0644,
> > +		.proc_handler	= &hugetlb_mempolicy_sysctl_handler,
> > +		.extra1		= (void *)&hugetlb_zero,
> > +		.extra2		= (void *)&hugetlb_infinity,
> > +	 },
> > +#endif
> >  	 {
> >  		.ctl_name	= VM_HUGETLB_GROUP,
> >  		.procname	= "hugetlb_shm_group",
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 83decd6..68abef0 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1244,6 +1244,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
> >  	return ret;
> >  }
> >  
> > +#define NUMA_NO_NODE_OBEY_MEMPOLICY (-2)
> >  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> >  static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> >  								int nid)
> 
> I think it would be possible to avoid adding NUMA_NO_NODE_OBEY_MEMPOLICY 
> if the nodemask was allocated in the sysctl handler instead and passing it 
> into set_max_huge_pages() instead of a nid.  Lee, what do you think?
> 
> Other than that, I like this approach because it avoids the potential for 
> userspace breakage while adding the new feature in way that avoids 
> confusion.
> 

Indeed. While the addition of another proc tunable sucks, it seems like
the only available compromise.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-14 13:33                             ` Mel Gorman
@ 2009-09-14 14:15                                 ` Lee Schermerhorn
  2009-09-14 19:14                                 ` David Rientjes
  1 sibling, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-09-14 14:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Mon, 2009-09-14 at 14:33 +0100, Mel Gorman wrote:
> On Fri, Sep 11, 2009 at 03:27:30PM -0700, David Rientjes wrote:
> > On Thu, 10 Sep 2009, Mel Gorman wrote:
> > 
> > > > Would you explain why introducing a new mempolicy flag, MPOL_F_HUGEPAGES, 
> > > > and only using the new behavior when this is set would be inconsistent or 
> > > > inadvisible?
> > > 
> > > I already explained this. The interface in numactl would look weird. There
> > > would be an --interleave switch and a --hugepages-interleave that only
> > > applies to nr_hugepages. The smarts could be in hugeadm to apply the mask
> > > when --pool-pages-min is specified but that wouldn't help scripts that are
> > > still using echo.
> > > 
> > 
> > I don't think we need to address the scripts that are currently using echo 
> > since they're (hopefully) written to the kernel implementation, i.e. no 
> > mempolicy restriction on writing to nr_hugepages.
> > 
> 
> Ok.
> 
> > > I hate to have to do this, but how about nr_hugepages which acts
> > > system-wide as it did traditionally and nr_hugepages_mempolicy that obeys
> > > policies? Something like the following untested patch. It would be fairly
> > > trivial for me to implement a --obey-mempolicies switch for hugeadm which
> > > works in conjunction with --pool--pages-min and less likely to cause confusion
> > > than --hugepages-interleave in numactl.
> > > 
> > 
> > I like it.
> > 
> 
> Ok, when I get this tested, I'll sent it as a follow-on patch to Lee's
> for proper incorporation.
> 
> > > Sorry the patch is untested. I can't hold of a NUMA machine at the moment
> > > and fake NUMA support sucks far worse than I expected it to.
> > > 
> > 
> > Hmm, I rewrote most of fake NUMA a couple years ago.  What problems are 
> > you having with it?
> > 
> 
> On PPC64, the parameters behave differently. I couldn't convince it to
> create more than one NUMA node. On x86-64, the NUMA nodes appeared to
> exist and would be visible on /proc/buddyinfo for example but the sysfs
> directories for the fake nodes were not created so nr_hugepages couldn't
> be examined on a per-node basis for example.
> 
> > > ==== BEGIN PATCH ====
> > > 
> > > [PATCH] Optionally use a memory policy when tuning the size of the static hugepage pool
> > > 
> > > Patch "derive huge pages nodes allowed from task mempolicy" brought
> > > huge page support more in line with the core VM in that tuning the size
> > > of the static huge page pool would obey memory policies. Using this,
> > > administrators could interleave allocation of huge pages from a subset
> > > of nodes. This is consistent with how dynamic hugepage pool resizing
> > > works and how hugepages get allocated to applications at run-time.
> > > 
> > > However, it was pointed out that scripts may exist that depend on being
> > > able to drain all hugepages via /proc/sys/vm/nr_hugepages from processes
> > > that are running within a memory policy. This patch adds
> > > /proc/sys/vm/nr_hugepages_mempolicy which when written to will obey
> > > memory policies. /proc/sys/vm/nr_hugepages continues then to be a
> > > system-wide tunable regardless of memory policy.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > --- 
> > >  include/linux/hugetlb.h |    1 +
> > >  kernel/sysctl.c         |   11 +++++++++++
> > >  mm/hugetlb.c            |   35 ++++++++++++++++++++++++++++++++---
> > >  3 files changed, 44 insertions(+), 3 deletions(-)
> > > 
> > 
> > It'll need an update to Documentation/vm/hugetlb.txt, but this can 
> > probably be done in one of Lee's patches that edits the same file when he 
> > reposts.
> > 
> 
> Agreed.

So, I'm respinning V7 today.  Shall I add this in as a separate patch?

Also, see below:


> 
> > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > > index fcb1677..fc3a659 100644
> > > --- a/include/linux/hugetlb.h
> > > +++ b/include/linux/hugetlb.h
> > > @@ -21,6 +21,7 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
> > >  
> > >  void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
> > >  int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> > > +int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> > >  int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> > >  int hugetlb_treat_movable_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> > >  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
> > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > > index 8bac3f5..0637655 100644
> > > --- a/kernel/sysctl.c
> > > +++ b/kernel/sysctl.c
> > > @@ -1171,6 +1171,17 @@ static struct ctl_table vm_table[] = {
> > >  		.extra1		= (void *)&hugetlb_zero,
> > >  		.extra2		= (void *)&hugetlb_infinity,
> > >  	 },
> > > +#ifdef CONFIG_NUMA
> > > +	 {
> > > +		.procname	= "nr_hugepages_mempolicy",
> > > +		.data		= NULL,
> > > +		.maxlen		= sizeof(unsigned long),
> > > +		.mode		= 0644,
> > > +		.proc_handler	= &hugetlb_mempolicy_sysctl_handler,
> > > +		.extra1		= (void *)&hugetlb_zero,
> > > +		.extra2		= (void *)&hugetlb_infinity,
> > > +	 },
> > > +#endif
> > >  	 {
> > >  		.ctl_name	= VM_HUGETLB_GROUP,
> > >  		.procname	= "hugetlb_shm_group",
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 83decd6..68abef0 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -1244,6 +1244,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
> > >  	return ret;
> > >  }
> > >  
> > > +#define NUMA_NO_NODE_OBEY_MEMPOLICY (-2)

How about defining NUMA_NO_NODE_OBEY_MEMPOLICY as (NUMA_NO_NODE - 1)
just to ensure that it's different.  Not sure it's worth an enum at this
point.  NUMA_NO_NODE_OBEY_MEMPOLICY is private to hugetlb at this time.

> > >  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> > >  static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> > >  								int nid)
> > 
> > I think it would be possible to avoid adding NUMA_NO_NODE_OBEY_MEMPOLICY 
> > if the nodemask was allocated in the sysctl handler instead and passing it 
> > into set_max_huge_pages() instead of a nid.  Lee, what do you think?
> > 
> > Other than that, I like this approach because it avoids the potential for 
> > userspace breakage while adding the new feature in way that avoids 
> > confusion.
> > 
> 
> Indeed. While the addition of another proc tunable sucks, it seems like
> the only available compromise.

And, I supposed we need to replicate this under the global sysfs
hstates?

Lee
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
@ 2009-09-14 14:15                                 ` Lee Schermerhorn
  0 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-09-14 14:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Mon, 2009-09-14 at 14:33 +0100, Mel Gorman wrote:
> On Fri, Sep 11, 2009 at 03:27:30PM -0700, David Rientjes wrote:
> > On Thu, 10 Sep 2009, Mel Gorman wrote:
> > 
> > > > Would you explain why introducing a new mempolicy flag, MPOL_F_HUGEPAGES, 
> > > > and only using the new behavior when this is set would be inconsistent or 
> > > > inadvisible?
> > > 
> > > I already explained this. The interface in numactl would look weird. There
> > > would be an --interleave switch and a --hugepages-interleave that only
> > > applies to nr_hugepages. The smarts could be in hugeadm to apply the mask
> > > when --pool-pages-min is specified but that wouldn't help scripts that are
> > > still using echo.
> > > 
> > 
> > I don't think we need to address the scripts that are currently using echo 
> > since they're (hopefully) written to the kernel implementation, i.e. no 
> > mempolicy restriction on writing to nr_hugepages.
> > 
> 
> Ok.
> 
> > > I hate to have to do this, but how about nr_hugepages which acts
> > > system-wide as it did traditionally and nr_hugepages_mempolicy that obeys
> > > policies? Something like the following untested patch. It would be fairly
> > > trivial for me to implement a --obey-mempolicies switch for hugeadm which
> > > works in conjunction with --pool--pages-min and less likely to cause confusion
> > > than --hugepages-interleave in numactl.
> > > 
> > 
> > I like it.
> > 
> 
> Ok, when I get this tested, I'll sent it as a follow-on patch to Lee's
> for proper incorporation.
> 
> > > Sorry the patch is untested. I can't hold of a NUMA machine at the moment
> > > and fake NUMA support sucks far worse than I expected it to.
> > > 
> > 
> > Hmm, I rewrote most of fake NUMA a couple years ago.  What problems are 
> > you having with it?
> > 
> 
> On PPC64, the parameters behave differently. I couldn't convince it to
> create more than one NUMA node. On x86-64, the NUMA nodes appeared to
> exist and would be visible on /proc/buddyinfo for example but the sysfs
> directories for the fake nodes were not created so nr_hugepages couldn't
> be examined on a per-node basis for example.
> 
> > > ==== BEGIN PATCH ====
> > > 
> > > [PATCH] Optionally use a memory policy when tuning the size of the static hugepage pool
> > > 
> > > Patch "derive huge pages nodes allowed from task mempolicy" brought
> > > huge page support more in line with the core VM in that tuning the size
> > > of the static huge page pool would obey memory policies. Using this,
> > > administrators could interleave allocation of huge pages from a subset
> > > of nodes. This is consistent with how dynamic hugepage pool resizing
> > > works and how hugepages get allocated to applications at run-time.
> > > 
> > > However, it was pointed out that scripts may exist that depend on being
> > > able to drain all hugepages via /proc/sys/vm/nr_hugepages from processes
> > > that are running within a memory policy. This patch adds
> > > /proc/sys/vm/nr_hugepages_mempolicy which when written to will obey
> > > memory policies. /proc/sys/vm/nr_hugepages continues then to be a
> > > system-wide tunable regardless of memory policy.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > --- 
> > >  include/linux/hugetlb.h |    1 +
> > >  kernel/sysctl.c         |   11 +++++++++++
> > >  mm/hugetlb.c            |   35 ++++++++++++++++++++++++++++++++---
> > >  3 files changed, 44 insertions(+), 3 deletions(-)
> > > 
> > 
> > It'll need an update to Documentation/vm/hugetlb.txt, but this can 
> > probably be done in one of Lee's patches that edits the same file when he 
> > reposts.
> > 
> 
> Agreed.

So, I'm respinning V7 today.  Shall I add this in as a separate patch?

Also, see below:


> 
> > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > > index fcb1677..fc3a659 100644
> > > --- a/include/linux/hugetlb.h
> > > +++ b/include/linux/hugetlb.h
> > > @@ -21,6 +21,7 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
> > >  
> > >  void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
> > >  int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> > > +int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> > >  int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> > >  int hugetlb_treat_movable_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> > >  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
> > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > > index 8bac3f5..0637655 100644
> > > --- a/kernel/sysctl.c
> > > +++ b/kernel/sysctl.c
> > > @@ -1171,6 +1171,17 @@ static struct ctl_table vm_table[] = {
> > >  		.extra1		= (void *)&hugetlb_zero,
> > >  		.extra2		= (void *)&hugetlb_infinity,
> > >  	 },
> > > +#ifdef CONFIG_NUMA
> > > +	 {
> > > +		.procname	= "nr_hugepages_mempolicy",
> > > +		.data		= NULL,
> > > +		.maxlen		= sizeof(unsigned long),
> > > +		.mode		= 0644,
> > > +		.proc_handler	= &hugetlb_mempolicy_sysctl_handler,
> > > +		.extra1		= (void *)&hugetlb_zero,
> > > +		.extra2		= (void *)&hugetlb_infinity,
> > > +	 },
> > > +#endif
> > >  	 {
> > >  		.ctl_name	= VM_HUGETLB_GROUP,
> > >  		.procname	= "hugetlb_shm_group",
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 83decd6..68abef0 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -1244,6 +1244,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
> > >  	return ret;
> > >  }
> > >  
> > > +#define NUMA_NO_NODE_OBEY_MEMPOLICY (-2)

How about defining NUMA_NO_NODE_OBEY_MEMPOLICY as (NUMA_NO_NODE - 1)
just to ensure that it's different.  Not sure it's worth an enum at this
point.  NUMA_NO_NODE_OBEY_MEMPOLICY is private to hugetlb at this time.

> > >  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> > >  static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> > >  								int nid)
> > 
> > I think it would be possible to avoid adding NUMA_NO_NODE_OBEY_MEMPOLICY 
> > if the nodemask was allocated in the sysctl handler instead and passing it 
> > into set_max_huge_pages() instead of a nid.  Lee, what do you think?
> > 
> > Other than that, I like this approach because it avoids the potential for 
> > userspace breakage while adding the new feature in way that avoids 
> > confusion.
> > 
> 
> Indeed. While the addition of another proc tunable sucks, it seems like
> the only available compromise.

And, I supposed we need to replicate this under the global sysfs
hstates?

Lee
> 


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-14 14:15                                 ` Lee Schermerhorn
@ 2009-09-14 15:41                                   ` Mel Gorman
  -1 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-14 15:41 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: David Rientjes, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Mon, Sep 14, 2009 at 10:15:48AM -0400, Lee Schermerhorn wrote:
> On Mon, 2009-09-14 at 14:33 +0100, Mel Gorman wrote:
> > On Fri, Sep 11, 2009 at 03:27:30PM -0700, David Rientjes wrote:
> > > On Thu, 10 Sep 2009, Mel Gorman wrote:
> > > 
> > > > > Would you explain why introducing a new mempolicy flag, MPOL_F_HUGEPAGES, 
> > > > > and only using the new behavior when this is set would be inconsistent or 
> > > > > inadvisible?
> > > > 
> > > > I already explained this. The interface in numactl would look weird. There
> > > > would be an --interleave switch and a --hugepages-interleave that only
> > > > applies to nr_hugepages. The smarts could be in hugeadm to apply the mask
> > > > when --pool-pages-min is specified but that wouldn't help scripts that are
> > > > still using echo.
> > > > 
> > > 
> > > I don't think we need to address the scripts that are currently using echo 
> > > since they're (hopefully) written to the kernel implementation, i.e. no 
> > > mempolicy restriction on writing to nr_hugepages.
> > > 
> > 
> > Ok.
> > 
> > > > I hate to have to do this, but how about nr_hugepages which acts
> > > > system-wide as it did traditionally and nr_hugepages_mempolicy that obeys
> > > > policies? Something like the following untested patch. It would be fairly
> > > > trivial for me to implement a --obey-mempolicies switch for hugeadm which
> > > > works in conjunction with --pool--pages-min and less likely to cause confusion
> > > > than --hugepages-interleave in numactl.
> > > > 
> > > 
> > > I like it.
> > > 
> > 
> > Ok, when I get this tested, I'll sent it as a follow-on patch to Lee's
> > for proper incorporation.
> > 
> > > > Sorry the patch is untested. I can't hold of a NUMA machine at the moment
> > > > and fake NUMA support sucks far worse than I expected it to.
> > > > 
> > > 
> > > Hmm, I rewrote most of fake NUMA a couple years ago.  What problems are 
> > > you having with it?
> > > 
> > 
> > On PPC64, the parameters behave differently. I couldn't convince it to
> > create more than one NUMA node. On x86-64, the NUMA nodes appeared to
> > exist and would be visible on /proc/buddyinfo for example but the sysfs
> > directories for the fake nodes were not created so nr_hugepages couldn't
> > be examined on a per-node basis for example.
> > 
> > > > ==== BEGIN PATCH ====
> > > > 
> > > > [PATCH] Optionally use a memory policy when tuning the size of the static hugepage pool
> > > > 
> > > > Patch "derive huge pages nodes allowed from task mempolicy" brought
> > > > huge page support more in line with the core VM in that tuning the size
> > > > of the static huge page pool would obey memory policies. Using this,
> > > > administrators could interleave allocation of huge pages from a subset
> > > > of nodes. This is consistent with how dynamic hugepage pool resizing
> > > > works and how hugepages get allocated to applications at run-time.
> > > > 
> > > > However, it was pointed out that scripts may exist that depend on being
> > > > able to drain all hugepages via /proc/sys/vm/nr_hugepages from processes
> > > > that are running within a memory policy. This patch adds
> > > > /proc/sys/vm/nr_hugepages_mempolicy which when written to will obey
> > > > memory policies. /proc/sys/vm/nr_hugepages continues then to be a
> > > > system-wide tunable regardless of memory policy.
> > > > 
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > --- 
> > > >  include/linux/hugetlb.h |    1 +
> > > >  kernel/sysctl.c         |   11 +++++++++++
> > > >  mm/hugetlb.c            |   35 ++++++++++++++++++++++++++++++++---
> > > >  3 files changed, 44 insertions(+), 3 deletions(-)
> > > > 
> > > 
> > > It'll need an update to Documentation/vm/hugetlb.txt, but this can 
> > > probably be done in one of Lee's patches that edits the same file when he 
> > > reposts.
> > > 
> > 
> > Agreed.
> 
> So, I'm respinning V7 today.  Shall I add this in as a separate patch?
> 

If you can, it'd be great but I can also spin a patch on top of yours
and update the documentation accordingly if you prefer.

> Also, see below:
> 
> 
> > 
> > > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > > > index fcb1677..fc3a659 100644
> > > > --- a/include/linux/hugetlb.h
> > > > +++ b/include/linux/hugetlb.h
> > > > @@ -21,6 +21,7 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
> > > >  
> > > >  void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
> > > >  int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> > > > +int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> > > >  int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> > > >  int hugetlb_treat_movable_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> > > >  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
> > > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > > > index 8bac3f5..0637655 100644
> > > > --- a/kernel/sysctl.c
> > > > +++ b/kernel/sysctl.c
> > > > @@ -1171,6 +1171,17 @@ static struct ctl_table vm_table[] = {
> > > >  		.extra1		= (void *)&hugetlb_zero,
> > > >  		.extra2		= (void *)&hugetlb_infinity,
> > > >  	 },
> > > > +#ifdef CONFIG_NUMA
> > > > +	 {
> > > > +		.procname	= "nr_hugepages_mempolicy",
> > > > +		.data		= NULL,
> > > > +		.maxlen		= sizeof(unsigned long),
> > > > +		.mode		= 0644,
> > > > +		.proc_handler	= &hugetlb_mempolicy_sysctl_handler,
> > > > +		.extra1		= (void *)&hugetlb_zero,
> > > > +		.extra2		= (void *)&hugetlb_infinity,
> > > > +	 },
> > > > +#endif
> > > >  	 {
> > > >  		.ctl_name	= VM_HUGETLB_GROUP,
> > > >  		.procname	= "hugetlb_shm_group",
> > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > index 83decd6..68abef0 100644
> > > > --- a/mm/hugetlb.c
> > > > +++ b/mm/hugetlb.c
> > > > @@ -1244,6 +1244,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
> > > >  	return ret;
> > > >  }
> > > >  
> > > > +#define NUMA_NO_NODE_OBEY_MEMPOLICY (-2)
> 
> How about defining NUMA_NO_NODE_OBEY_MEMPOLICY as (NUMA_NO_NODE - 1)
> just to ensure that it's different.  Not sure it's worth an enum at this
> point.  NUMA_NO_NODE_OBEY_MEMPOLICY is private to hugetlb at this time.
> 

That seems reasonable.

> > > >  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> > > >  static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> > > >  								int nid)
> > > 
> > > I think it would be possible to avoid adding NUMA_NO_NODE_OBEY_MEMPOLICY 
> > > if the nodemask was allocated in the sysctl handler instead and passing it 
> > > into set_max_huge_pages() instead of a nid.  Lee, what do you think?
> > > 
> > > Other than that, I like this approach because it avoids the potential for 
> > > userspace breakage while adding the new feature in way that avoids 
> > > confusion.
> > > 
> > 
> > Indeed. While the addition of another proc tunable sucks, it seems like
> > the only available compromise.
> 
> And, I supposed we need to replicate this under the global sysfs
> hstates?
> 

You're right, they would also be needed under /sys/kernel/mm/hugepages/*

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
@ 2009-09-14 15:41                                   ` Mel Gorman
  0 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-14 15:41 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: David Rientjes, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Mon, Sep 14, 2009 at 10:15:48AM -0400, Lee Schermerhorn wrote:
> On Mon, 2009-09-14 at 14:33 +0100, Mel Gorman wrote:
> > On Fri, Sep 11, 2009 at 03:27:30PM -0700, David Rientjes wrote:
> > > On Thu, 10 Sep 2009, Mel Gorman wrote:
> > > 
> > > > > Would you explain why introducing a new mempolicy flag, MPOL_F_HUGEPAGES, 
> > > > > and only using the new behavior when this is set would be inconsistent or 
> > > > > inadvisible?
> > > > 
> > > > I already explained this. The interface in numactl would look weird. There
> > > > would be an --interleave switch and a --hugepages-interleave that only
> > > > applies to nr_hugepages. The smarts could be in hugeadm to apply the mask
> > > > when --pool-pages-min is specified but that wouldn't help scripts that are
> > > > still using echo.
> > > > 
> > > 
> > > I don't think we need to address the scripts that are currently using echo 
> > > since they're (hopefully) written to the kernel implementation, i.e. no 
> > > mempolicy restriction on writing to nr_hugepages.
> > > 
> > 
> > Ok.
> > 
> > > > I hate to have to do this, but how about nr_hugepages which acts
> > > > system-wide as it did traditionally and nr_hugepages_mempolicy that obeys
> > > > policies? Something like the following untested patch. It would be fairly
> > > > trivial for me to implement a --obey-mempolicies switch for hugeadm which
> > > > works in conjunction with --pool--pages-min and less likely to cause confusion
> > > > than --hugepages-interleave in numactl.
> > > > 
> > > 
> > > I like it.
> > > 
> > 
> > Ok, when I get this tested, I'll sent it as a follow-on patch to Lee's
> > for proper incorporation.
> > 
> > > > Sorry the patch is untested. I can't hold of a NUMA machine at the moment
> > > > and fake NUMA support sucks far worse than I expected it to.
> > > > 
> > > 
> > > Hmm, I rewrote most of fake NUMA a couple years ago.  What problems are 
> > > you having with it?
> > > 
> > 
> > On PPC64, the parameters behave differently. I couldn't convince it to
> > create more than one NUMA node. On x86-64, the NUMA nodes appeared to
> > exist and would be visible on /proc/buddyinfo for example but the sysfs
> > directories for the fake nodes were not created so nr_hugepages couldn't
> > be examined on a per-node basis for example.
> > 
> > > > ==== BEGIN PATCH ====
> > > > 
> > > > [PATCH] Optionally use a memory policy when tuning the size of the static hugepage pool
> > > > 
> > > > Patch "derive huge pages nodes allowed from task mempolicy" brought
> > > > huge page support more in line with the core VM in that tuning the size
> > > > of the static huge page pool would obey memory policies. Using this,
> > > > administrators could interleave allocation of huge pages from a subset
> > > > of nodes. This is consistent with how dynamic hugepage pool resizing
> > > > works and how hugepages get allocated to applications at run-time.
> > > > 
> > > > However, it was pointed out that scripts may exist that depend on being
> > > > able to drain all hugepages via /proc/sys/vm/nr_hugepages from processes
> > > > that are running within a memory policy. This patch adds
> > > > /proc/sys/vm/nr_hugepages_mempolicy which when written to will obey
> > > > memory policies. /proc/sys/vm/nr_hugepages continues then to be a
> > > > system-wide tunable regardless of memory policy.
> > > > 
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > --- 
> > > >  include/linux/hugetlb.h |    1 +
> > > >  kernel/sysctl.c         |   11 +++++++++++
> > > >  mm/hugetlb.c            |   35 ++++++++++++++++++++++++++++++++---
> > > >  3 files changed, 44 insertions(+), 3 deletions(-)
> > > > 
> > > 
> > > It'll need an update to Documentation/vm/hugetlb.txt, but this can 
> > > probably be done in one of Lee's patches that edits the same file when he 
> > > reposts.
> > > 
> > 
> > Agreed.
> 
> So, I'm respinning V7 today.  Shall I add this in as a separate patch?
> 

If you can, it'd be great but I can also spin a patch on top of yours
and update the documentation accordingly if you prefer.

> Also, see below:
> 
> 
> > 
> > > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > > > index fcb1677..fc3a659 100644
> > > > --- a/include/linux/hugetlb.h
> > > > +++ b/include/linux/hugetlb.h
> > > > @@ -21,6 +21,7 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
> > > >  
> > > >  void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
> > > >  int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> > > > +int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> > > >  int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> > > >  int hugetlb_treat_movable_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> > > >  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
> > > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > > > index 8bac3f5..0637655 100644
> > > > --- a/kernel/sysctl.c
> > > > +++ b/kernel/sysctl.c
> > > > @@ -1171,6 +1171,17 @@ static struct ctl_table vm_table[] = {
> > > >  		.extra1		= (void *)&hugetlb_zero,
> > > >  		.extra2		= (void *)&hugetlb_infinity,
> > > >  	 },
> > > > +#ifdef CONFIG_NUMA
> > > > +	 {
> > > > +		.procname	= "nr_hugepages_mempolicy",
> > > > +		.data		= NULL,
> > > > +		.maxlen		= sizeof(unsigned long),
> > > > +		.mode		= 0644,
> > > > +		.proc_handler	= &hugetlb_mempolicy_sysctl_handler,
> > > > +		.extra1		= (void *)&hugetlb_zero,
> > > > +		.extra2		= (void *)&hugetlb_infinity,
> > > > +	 },
> > > > +#endif
> > > >  	 {
> > > >  		.ctl_name	= VM_HUGETLB_GROUP,
> > > >  		.procname	= "hugetlb_shm_group",
> > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > index 83decd6..68abef0 100644
> > > > --- a/mm/hugetlb.c
> > > > +++ b/mm/hugetlb.c
> > > > @@ -1244,6 +1244,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
> > > >  	return ret;
> > > >  }
> > > >  
> > > > +#define NUMA_NO_NODE_OBEY_MEMPOLICY (-2)
> 
> How about defining NUMA_NO_NODE_OBEY_MEMPOLICY as (NUMA_NO_NODE - 1)
> just to ensure that it's different.  Not sure it's worth an enum at this
> point.  NUMA_NO_NODE_OBEY_MEMPOLICY is private to hugetlb at this time.
> 

That seems reasonable.

> > > >  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> > > >  static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> > > >  								int nid)
> > > 
> > > I think it would be possible to avoid adding NUMA_NO_NODE_OBEY_MEMPOLICY 
> > > if the nodemask was allocated in the sysctl handler instead and passing it 
> > > into set_max_huge_pages() instead of a nid.  Lee, what do you think?
> > > 
> > > Other than that, I like this approach because it avoids the potential for 
> > > userspace breakage while adding the new feature in way that avoids 
> > > confusion.
> > > 
> > 
> > Indeed. While the addition of another proc tunable sucks, it seems like
> > the only available compromise.
> 
> And, I supposed we need to replicate this under the global sysfs
> hstates?
> 

You're right, they would also be needed under /sys/kernel/mm/hugepages/*

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-14 13:33                             ` Mel Gorman
@ 2009-09-14 19:14                                 ` David Rientjes
  2009-09-14 19:14                                 ` David Rientjes
  1 sibling, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-14 19:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Mon, 14 Sep 2009, Mel Gorman wrote:

> > Hmm, I rewrote most of fake NUMA a couple years ago.  What problems are 
> > you having with it?
> > 
> 
> On PPC64, the parameters behave differently. I couldn't convince it to
> create more than one NUMA node. On x86-64, the NUMA nodes appeared to
> exist and would be visible on /proc/buddyinfo for example but the sysfs
> directories for the fake nodes were not created so nr_hugepages couldn't
> be examined on a per-node basis for example.
> 

I don't know anything about the ppc64 fake NUMA, but the sysfs node 
directories should certainly be created on x86_64.  I'll look into it 
because that's certainly a bug.  Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
@ 2009-09-14 19:14                                 ` David Rientjes
  0 siblings, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-14 19:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Mon, 14 Sep 2009, Mel Gorman wrote:

> > Hmm, I rewrote most of fake NUMA a couple years ago.  What problems are 
> > you having with it?
> > 
> 
> On PPC64, the parameters behave differently. I couldn't convince it to
> create more than one NUMA node. On x86-64, the NUMA nodes appeared to
> exist and would be visible on /proc/buddyinfo for example but the sysfs
> directories for the fake nodes were not created so nr_hugepages couldn't
> be examined on a per-node basis for example.
> 

I don't know anything about the ppc64 fake NUMA, but the sysfs node 
directories should certainly be created on x86_64.  I'll look into it 
because that's certainly a bug.  Thanks.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-14 15:41                                   ` Mel Gorman
@ 2009-09-14 19:15                                     ` David Rientjes
  -1 siblings, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-14 19:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Mon, 14 Sep 2009, Mel Gorman wrote:

> > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > > index 83decd6..68abef0 100644
> > > > > --- a/mm/hugetlb.c
> > > > > +++ b/mm/hugetlb.c
> > > > > @@ -1244,6 +1244,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
> > > > >  	return ret;
> > > > >  }
> > > > >  
> > > > > +#define NUMA_NO_NODE_OBEY_MEMPOLICY (-2)
> > 
> > How about defining NUMA_NO_NODE_OBEY_MEMPOLICY as (NUMA_NO_NODE - 1)
> > just to ensure that it's different.  Not sure it's worth an enum at this
> > point.  NUMA_NO_NODE_OBEY_MEMPOLICY is private to hugetlb at this time.
> > 
> 
> That seems reasonable.
> 

If the nodemask allocation is moved to the sysctl handler and nodemask_t 
is passed into set_max_huge_pages() instead of nid, you don't need 
NUMA_NO_NODE_OBEY_MEMPOLICY at all, though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
@ 2009-09-14 19:15                                     ` David Rientjes
  0 siblings, 0 replies; 81+ messages in thread
From: David Rientjes @ 2009-09-14 19:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Mon, 14 Sep 2009, Mel Gorman wrote:

> > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > > index 83decd6..68abef0 100644
> > > > > --- a/mm/hugetlb.c
> > > > > +++ b/mm/hugetlb.c
> > > > > @@ -1244,6 +1244,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
> > > > >  	return ret;
> > > > >  }
> > > > >  
> > > > > +#define NUMA_NO_NODE_OBEY_MEMPOLICY (-2)
> > 
> > How about defining NUMA_NO_NODE_OBEY_MEMPOLICY as (NUMA_NO_NODE - 1)
> > just to ensure that it's different.  Not sure it's worth an enum at this
> > point.  NUMA_NO_NODE_OBEY_MEMPOLICY is private to hugetlb at this time.
> > 
> 
> That seems reasonable.
> 

If the nodemask allocation is moved to the sysctl handler and nodemask_t 
is passed into set_max_huge_pages() instead of nid, you don't need 
NUMA_NO_NODE_OBEY_MEMPOLICY at all, though.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-14 19:14                                 ` David Rientjes
  (?)
@ 2009-09-14 21:28                                 ` David Rientjes
  2009-09-16 10:21                                   ` Mel Gorman
  -1 siblings, 1 reply; 81+ messages in thread
From: David Rientjes @ 2009-09-14 21:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Mon, 14 Sep 2009, David Rientjes wrote:

> > On PPC64, the parameters behave differently. I couldn't convince it to
> > create more than one NUMA node. On x86-64, the NUMA nodes appeared to
> > exist and would be visible on /proc/buddyinfo for example but the sysfs
> > directories for the fake nodes were not created so nr_hugepages couldn't
> > be examined on a per-node basis for example.
> > 
> 
> I don't know anything about the ppc64 fake NUMA, but the sysfs node 
> directories should certainly be created on x86_64.  I'll look into it 
> because that's certainly a bug.  Thanks.
> 

This works on my machine just fine.

For example, with numa=fake=8:

	$ ls /sys/devices/system/node
	has_cpu  has_normal_memory  node0  node1  node2  node3  node4  
node5  node6  node7  online  possible

	$ ls /sys/devices/system/node/node3
	cpu4  cpu5  cpu6  cpu7  cpulist  cpumap  distance  meminfo  
numastat  scan_unevictable_pages

I don't see how this could differ if bootmem is setting up the nodes 
correctly, which dmesg | grep "^Bootmem setup node" would reveal.

The defconfig disables CONFIG_NUMA_EMU now, though, so perhaps it got 
turned off by accident in your kernel?

Let me know if there's any abnormalities with your particular setup.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-14 19:15                                     ` David Rientjes
@ 2009-09-15 11:48                                       ` Mel Gorman
  -1 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-15 11:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Mon, Sep 14, 2009 at 12:15:43PM -0700, David Rientjes wrote:
> On Mon, 14 Sep 2009, Mel Gorman wrote:
> 
> > > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > > > index 83decd6..68abef0 100644
> > > > > > --- a/mm/hugetlb.c
> > > > > > +++ b/mm/hugetlb.c
> > > > > > @@ -1244,6 +1244,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
> > > > > >  	return ret;
> > > > > >  }
> > > > > >  
> > > > > > +#define NUMA_NO_NODE_OBEY_MEMPOLICY (-2)
> > > 
> > > How about defining NUMA_NO_NODE_OBEY_MEMPOLICY as (NUMA_NO_NODE - 1)
> > > just to ensure that it's different.  Not sure it's worth an enum at this
> > > point.  NUMA_NO_NODE_OBEY_MEMPOLICY is private to hugetlb at this time.
> > > 
> > 
> > That seems reasonable.
> > 
> 
> If the nodemask allocation is moved to the sysctl handler and nodemask_t 
> is passed into set_max_huge_pages() instead of nid, you don't need 
> NUMA_NO_NODE_OBEY_MEMPOLICY at all, though.
> 

Very likely. When V7 comes out, I'll spin a patch for that and see what
it looks like if Lee doesn't beat me to it.


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
@ 2009-09-15 11:48                                       ` Mel Gorman
  0 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-15 11:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Mon, Sep 14, 2009 at 12:15:43PM -0700, David Rientjes wrote:
> On Mon, 14 Sep 2009, Mel Gorman wrote:
> 
> > > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > > > index 83decd6..68abef0 100644
> > > > > > --- a/mm/hugetlb.c
> > > > > > +++ b/mm/hugetlb.c
> > > > > > @@ -1244,6 +1244,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
> > > > > >  	return ret;
> > > > > >  }
> > > > > >  
> > > > > > +#define NUMA_NO_NODE_OBEY_MEMPOLICY (-2)
> > > 
> > > How about defining NUMA_NO_NODE_OBEY_MEMPOLICY as (NUMA_NO_NODE - 1)
> > > just to ensure that it's different.  Not sure it's worth an enum at this
> > > point.  NUMA_NO_NODE_OBEY_MEMPOLICY is private to hugetlb at this time.
> > > 
> > 
> > That seems reasonable.
> > 
> 
> If the nodemask allocation is moved to the sysctl handler and nodemask_t 
> is passed into set_max_huge_pages() instead of nid, you don't need 
> NUMA_NO_NODE_OBEY_MEMPOLICY at all, though.
> 

Very likely. When V7 comes out, I'll spin a patch for that and see what
it looks like if Lee doesn't beat me to it.


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-14 21:28                                 ` David Rientjes
@ 2009-09-16 10:21                                   ` Mel Gorman
  0 siblings, 0 replies; 81+ messages in thread
From: Mel Gorman @ 2009-09-16 10:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Nishanth Aravamudan,
	linux-numa, Adam Litke, Andy Whitcroft, Eric Whitney,
	Randy Dunlap

On Mon, Sep 14, 2009 at 02:28:27PM -0700, David Rientjes wrote:
> On Mon, 14 Sep 2009, David Rientjes wrote:
> 
> > > On PPC64, the parameters behave differently. I couldn't convince it to
> > > create more than one NUMA node. On x86-64, the NUMA nodes appeared to
> > > exist and would be visible on /proc/buddyinfo for example but the sysfs
> > > directories for the fake nodes were not created so nr_hugepages couldn't
> > > be examined on a per-node basis for example.
> > > 
> > 
> > I don't know anything about the ppc64 fake NUMA, but the sysfs node 
> > directories should certainly be created on x86_64.  I'll look into it 
> > because that's certainly a bug.  Thanks.
> > 
> 
> This works on my machine just fine.
> 
> For example, with numa=fake=8:
> 
> 	$ ls /sys/devices/system/node
> 	has_cpu  has_normal_memory  node0  node1  node2  node3  node4  
> node5  node6  node7  online  possible
> 
> 	$ ls /sys/devices/system/node/node3
> 	cpu4  cpu5  cpu6  cpu7  cpulist  cpumap  distance  meminfo  
> numastat  scan_unevictable_pages
> 
> I don't see how this could differ if bootmem is setting up the nodes 
> correctly, which dmesg | grep "^Bootmem setup node" would reveal.
> 
> The defconfig disables CONFIG_NUMA_EMU now, though, so perhaps it got 
> turned off by accident in your kernel?
> 

I don't think so because my recollection is that the nodes existed
according to meminfo and buddyinfo but not the sysfs files.
Unfortunately I can't remember the reproduction scenario. I thought it
was on mmotm-2009-08-27-16-51 on a particularly machine but when I went
to reproduce, it didn't even boot so somewhere along the line I busted
things.

> Let me know if there's any abnormalities with your particular setup.
> 

Will try reproducing on more recent mmotm and see if anything odd falls
out.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-09 16:31 [PATCH 0/6] hugetlb: V6 constrain allocation/free based on task mempolicy Lee Schermerhorn
@ 2009-09-09 16:32   ` Lee Schermerhorn
  0 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-09-09 16:32 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.

Against:  2.6.31-rc7-mmotm-090827-1651

V2:  Add brief description of per node attributes.

V6:  address review comments

This patch updates the kernel huge tlb documentation to describe the
numa memory policy based huge page management.  Additionaly, the patch
includes a fair amount of rework to improve consistency, eliminate
duplication and set the context for documenting the memory policy
interaction.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: David Rientjes <rientjes@google.com>

 Documentation/vm/hugetlbpage.txt |  263 +++++++++++++++++++++++++--------------
 1 file changed, 175 insertions(+), 88 deletions(-)

Index: linux-2.6.31-rc7-mmotm-090827-1651/Documentation/vm/hugetlbpage.txt
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-1651.orig/Documentation/vm/hugetlbpage.txt	2009-09-09 11:57:26.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-1651/Documentation/vm/hugetlbpage.txt	2009-09-09 11:57:37.000000000 -0400
@@ -11,23 +11,21 @@ This optimization is more critical now a
 (several GBs) are more readily available.
 
 Users can use the huge page support in Linux kernel by either using the mmap
-system call or standard SYSv shared memory system calls (shmget, shmat).
+system call or standard SYSV shared memory system calls (shmget, shmat).
 
 First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
 (present under "File systems") and CONFIG_HUGETLB_PAGE (selected
 automatically when CONFIG_HUGETLBFS is selected) configuration
 options.
 
-The kernel built with huge page support should show the number of configured
-huge pages in the system by running the "cat /proc/meminfo" command.
+The /proc/meminfo file provides information about the total number of
+persistent hugetlb pages in the kernel's huge page pool.  It also displays
+information about the number of free, reserved and surplus huge pages and the
+default huge page size.  The huge page size is needed for generating the
+proper alignment and size of the arguments to system calls that map huge page
+regions.
 
-/proc/meminfo also provides information about the total number of hugetlb
-pages configured in the kernel.  It also displays information about the
-number of free hugetlb pages at any time.  It also displays information about
-the configured huge page size - this is needed for generating the proper
-alignment and size of the arguments to the above system calls.
-
-The output of "cat /proc/meminfo" will have lines like:
+The output of "cat /proc/meminfo" will include lines like:
 
 .....
 HugePages_Total: vvv
@@ -53,59 +51,63 @@ HugePages_Surp  is short for "surplus,"
 /proc/filesystems should also show a filesystem of type "hugetlbfs" configured
 in the kernel.
 
-/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
-pages in the kernel.  Super user can dynamically request more (or free some
-pre-configured) huge pages.
-The allocation (or deallocation) of hugetlb pages is possible only if there are
-enough physically contiguous free pages in system (freeing of huge pages is
-possible only if there are enough hugetlb pages free that can be transferred
-back to regular memory pool).
-
-Pages that are used as hugetlb pages are reserved inside the kernel and cannot
-be used for other purposes.
-
-Once the kernel with Hugetlb page support is built and running, a user can
-use either the mmap system call or shared memory system calls to start using
-the huge pages.  It is required that the system administrator preallocate
-enough memory for huge page purposes.
-
-The administrator can preallocate huge pages on the kernel boot command line by
-specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
-requested.  This is the most reliable method for preallocating huge pages as
-memory has not yet become fragmented.
+/proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge
+pages in the kernel's huge page pool.  "Persistent" huge pages will be
+returned to the huge page pool when freed by a task.  A user with root
+privileges can dynamically allocate more or free some persistent huge pages
+by increasing or decreasing the value of 'nr_hugepages'.
+
+Pages that are used as huge pages are reserved inside the kernel and cannot
+be used for other purposes.  Huge pages cannot be swapped out under
+memory pressure.
+
+Once a number of huge pages have been pre-allocated to the kernel huge page
+pool, a user with appropriate privilege can use either the mmap system call
+or shared memory system calls to use the huge pages.  See the discussion of
+Using Huge Pages, below.
+
+The administrator can allocate persistent huge pages on the kernel boot
+command line by specifying the "hugepages=N" parameter, where 'N' = the
+number of huge pages requested.  This is the most reliable method of
+allocating huge pages as memory has not yet become fragmented.
 
-Some platforms support multiple huge page sizes.  To preallocate huge pages
+Some platforms support multiple huge page sizes.  To allocate huge pages
 of a specific size, one must preceed the huge pages boot command parameters
 with a huge page size selection parameter "hugepagesz=<size>".  <size> must
 be specified in bytes with optional scale suffix [kKmMgG].  The default huge
 page size may be selected with the "default_hugepagesz=<size>" boot parameter.
 
-/proc/sys/vm/nr_hugepages indicates the current number of configured [default
-size] hugetlb pages in the kernel.  Super user can dynamically request more
-(or free some pre-configured) huge pages.
-
-Use the following command to dynamically allocate/deallocate default sized
-huge pages:
+When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
+indicates the current number of pre-allocated huge pages of the default size.
+Thus, one can use the following command to dynamically allocate/deallocate
+default sized persistent huge pages:
 
 	echo 20 > /proc/sys/vm/nr_hugepages
 
-This command will try to configure 20 default sized huge pages in the system.
+This command will try to adjust the number of default sized huge pages in the
+huge page pool to 20, allocating or freeing huge pages, as required.
+
 On a NUMA platform, the kernel will attempt to distribute the huge page pool
-over the all on-line nodes.  These huge pages, allocated when nr_hugepages
-is increased, are called "persistent huge pages".
+over all the set of allowed nodes specified by the NUMA memory policy of the
+task that modifies nr_hugepages.  The default for the allowed nodes--when the
+task has default memory policy--is all on-line nodes.  Allowed nodes with
+insufficient available, contiguous memory for a huge page will be silently
+skipped when allocating persistent huge pages.  See the discussion below of
+the interaction of task memory policy, cpusets and per node attributes with
+the allocation and freeing of persistent huge pages.
 
 The success or failure of huge page allocation depends on the amount of
-physically contiguous memory that is preset in system at the time of the
+physically contiguous memory that is present in system at the time of the
 allocation attempt.  If the kernel is unable to allocate huge pages from
 some nodes in a NUMA system, it will attempt to make up the difference by
 allocating extra pages on other nodes with sufficient available contiguous
 memory, if any.
 
-System administrators may want to put this command in one of the local rc init
-files.  This will enable the kernel to request huge pages early in the boot
-process when the possibility of getting physical contiguous pages is still
-very high.  Administrators can verify the number of huge pages actually
-allocated by checking the sysctl or meminfo.  To check the per node
+System administrators may want to put this command in one of the local rc
+init files.  This will enable the kernel to allocate huge pages early in
+the boot process when the possibility of getting physical contiguous pages
+is still very high.  Administrators can verify the number of huge pages
+actually allocated by checking the sysctl or meminfo.  To check the per node
 distribution of huge pages in a NUMA system, use:
 
 	cat /sys/devices/system/node/node*/meminfo | fgrep Huge
@@ -113,39 +115,40 @@ distribution of huge pages in a NUMA sys
 /proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
 huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
 requested by applications.  Writing any non-zero value into this file
-indicates that the hugetlb subsystem is allowed to try to obtain "surplus"
-huge pages from the buddy allocator, when the normal pool is exhausted. As
-these surplus huge pages go out of use, they are freed back to the buddy
-allocator.
+indicates that the hugetlb subsystem is allowed to try to obtain that
+number of "surplus" huge pages from the kernel's normal page pool, when the
+persistent huge page pool is exhausted. As these surplus huge pages become
+unused, they are freed back to the kernel's normal page pool.
 
-When increasing the huge page pool size via nr_hugepages, any surplus
+When increasing the huge page pool size via nr_hugepages, any existing surplus
 pages will first be promoted to persistent huge pages.  Then, additional
 huge pages will be allocated, if necessary and if possible, to fulfill
-the new huge page pool size.
+the new persistent huge page pool size.
 
-The administrator may shrink the pool of preallocated huge pages for
+The administrator may shrink the pool of persistent huge pages for
 the default huge page size by setting the nr_hugepages sysctl to a
 smaller value.  The kernel will attempt to balance the freeing of huge pages
-across all on-line nodes.  Any free huge pages on the selected nodes will
-be freed back to the buddy allocator.
-
-Caveat: Shrinking the pool via nr_hugepages such that it becomes less
-than the number of huge pages in use will convert the balance to surplus
-huge pages even if it would exceed the overcommit value.  As long as
-this condition holds, however, no more surplus huge pages will be
-allowed on the system until one of the two sysctls are increased
-sufficiently, or the surplus huge pages go out of use and are freed.
+across all nodes in the memory policy of the task modifying nr_hugepages.
+Any free huge pages on the selected nodes will be freed back to the kernel's
+normal page pool.
+
+Caveat: Shrinking the persistent huge page pool via nr_hugepages such that
+it becomes less than the number of huge pages in use will convert the balance
+of the in-use huge pages to surplus huge pages.  This will occur even if
+the number of surplus pages it would exceed the overcommit value.  As long as
+this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is
+increased sufficiently, or the surplus huge pages go out of use and are freed--
+no more surplus huge pages will be allowed to be allocated.
 
 With support for multiple huge page pools at run-time available, much of
-the huge page userspace interface has been duplicated in sysfs. The above
-information applies to the default huge page size which will be
-controlled by the /proc interfaces for backwards compatibility. The root
-huge page control directory in sysfs is:
+the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs.
+The /proc interfaces discussed above have been retained for backwards
+compatibility. The root huge page control directory in sysfs is:
 
 	/sys/kernel/mm/hugepages
 
 For each huge page size supported by the running kernel, a subdirectory
-will exist, of the form
+will exist, of the form:
 
 	hugepages-${size}kB
 
@@ -159,6 +162,98 @@ Inside each of these directories, the sa
 
 which function as described above for the default huge page-sized case.
 
+
+Interaction of Task Memory Policy with Huge Page Allocation/Freeing:
+
+Whether huge pages are allocated and freed via the /proc interface or
+the /sysfs interface, the NUMA nodes from which huge pages are allocated
+or freed are controlled by the NUMA memory policy of the task that modifies
+the nr_hugepages parameter.  [nr_overcommit_hugepages is a global limit.]
+
+The recommended method to allocate or free huge pages to/from the kernel
+huge page pool, using the nr_hugepages example above, is:
+
+    numactl --interleave <node-list> echo 20 >/proc/sys/vm/nr_hugepages
+
+or, more succinctly:
+
+    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages
+
+This will allocate or free abs(20 - nr_hugepages) to or from the nodes
+specified in <node-list>, depending on whether nr_hugepages is initially
+less than or greater than 20, respectively.  No huge pages will be
+allocated nor freed on any node not included in the specified <node-list>.
+
+Any memory policy mode--bind, preferred, local or interleave--may be
+used.  The effect on persistent huge page allocation is as follows:
+
+1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
+   persistent huge pages will be distributed across the node or nodes
+   specified in the mempolicy as if "interleave" had been specified.
+   However, if a node in the policy does not contain sufficient contiguous
+   memory for a huge page, the allocation will not "fallback" to the nearest
+   neighbor node with sufficient contiguous memory.  To do this would cause
+   undesirable imbalance in the distribution of the huge page pool, or
+   possibly, allocation of persistent huge pages on nodes not allowed by
+   the task's memory policy.
+
+2) One or more nodes may be specified with the bind or interleave policy.
+   If more than one node is specified with the preferred policy, only the
+   lowest numeric id will be used.  Local policy will select the node where
+   the task is running at the time the nodes_allowed mask is constructed.
+
+3) For local policy to be deterministic, the task must be bound to a cpu or
+   cpus in a single node.  Otherwise, the task could be migrated to some
+   other node at any time after launch and the resulting node will be
+   indeterminate.  Thus, local policy is not very useful for this purpose.
+   Any of the other mempolicy modes may be used to specify a single node.
+
+4) The nodes allowed mask will be derived from any non-default task mempolicy,
+   whether this policy was set explicitly by the task itself or one of its
+   ancestors, such as numactl.  This means that if the task is invoked from a
+   shell with non-default policy, that policy will be used.  One can specify a
+   node list of "all" with numactl --interleave or --membind [-m] to achieve
+   interleaving over all nodes in the system or cpuset.
+
+5) Any task mempolicy specifed--e.g., using numactl--will be constrained by
+   the resource limits of any cpuset in which the task runs.  Thus, there will
+   be no way for a task with non-default policy running in a cpuset with a
+   subset of the system nodes to allocate huge pages outside the cpuset
+   without first moving to a cpuset that contains all of the desired nodes.
+
+6) Boot-time huge page allocation attempts to distribute the requested number
+   of huge pages over all on-lines nodes.
+
+Per Node Hugepages Attributes
+
+A subset of the contents of the root huge page control directory in sysfs,
+described above, has been replicated under each "node" system device in:
+
+	/sys/devices/system/node/node[0-9]*/hugepages/
+
+Under this directory, the subdirectory for each supported huge page size
+contains the following attribute files:
+
+	nr_hugepages
+	free_hugepages
+	surplus_hugepages
+
+The free_' and surplus_' attribute files are read-only.  They return the number
+of free and surplus [overcommitted] huge pages, respectively, on the parent
+node.
+
+The nr_hugepages attribute will return the total number of huge pages on the
+specified node.  When this attribute is written, the number of persistent huge
+pages on the parent node will be adjusted to the specified value, if sufficient
+resources exist, regardless of the task's mempolicy or cpuset constraints.
+
+Note that the number of overcommit and reserve pages remain global quantities,
+as we don't know until fault time, when the faulting task's mempolicy is applied,
+from which node the huge page allocation will be attempted.
+
+
+Using Huge Pages:
+
 If the user applications are going to request huge pages using mmap system
 call, then it is required that system administrator mount a file system of
 type hugetlbfs:
@@ -206,9 +301,11 @@ map_hugetlb.c.
  * requesting huge pages.
  *
  * For the ia64 architecture, the Linux kernel reserves Region number 4 for
- * huge pages.  That means the addresses starting with 0x800000... will need
- * to be specified.  Specifying a fixed address is not required on ppc64,
- * i386 or x86_64.
+ * huge pages.  That means that if one requires a fixed address, a huge page
+ * aligned address starting with 0x800000... will be required.  If a fixed
+ * address is not required, the kernel will select an address in the proper
+ * range.
+ * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
  *
  * Note: The default shared memory limit is quite low on many kernels,
  * you may need to increase it via:
@@ -237,14 +334,8 @@ map_hugetlb.c.
 
 #define dprintf(x)  printf(x)
 
-/* Only ia64 requires this */
-#ifdef __ia64__
-#define ADDR (void *)(0x8000000000000000UL)
-#define SHMAT_FLAGS (SHM_RND)
-#else
-#define ADDR (void *)(0x0UL)
+#define ADDR (void *)(0x0UL)	/* let kernel choose address */
 #define SHMAT_FLAGS (0)
-#endif
 
 int main(void)
 {
@@ -302,10 +393,12 @@ int main(void)
  * example, the app is requesting memory of size 256MB that is backed by
  * huge pages.
  *
- * For ia64 architecture, Linux kernel reserves Region number 4 for huge pages.
- * That means the addresses starting with 0x800000... will need to be
- * specified.  Specifying a fixed address is not required on ppc64, i386
- * or x86_64.
+ * For the ia64 architecture, the Linux kernel reserves Region number 4 for
+ * huge pages.  That means that if one requires a fixed address, a huge page
+ * aligned address starting with 0x800000... will be required.  If a fixed
+ * address is not required, the kernel will select an address in the proper
+ * range.
+ * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
  */
 #include <stdlib.h>
 #include <stdio.h>
@@ -317,14 +410,8 @@ int main(void)
 #define LENGTH (256UL*1024*1024)
 #define PROTECTION (PROT_READ | PROT_WRITE)
 
-/* Only ia64 requires this */
-#ifdef __ia64__
-#define ADDR (void *)(0x8000000000000000UL)
-#define FLAGS (MAP_SHARED | MAP_FIXED)
-#else
-#define ADDR (void *)(0x0UL)
+#define ADDR (void *)(0x0UL)	/* let kernel choose address */
 #define FLAGS (MAP_SHARED)
-#endif
 
 void check_bytes(char *addr)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.
@ 2009-09-09 16:32   ` Lee Schermerhorn
  0 siblings, 0 replies; 81+ messages in thread
From: Lee Schermerhorn @ 2009-09-09 16:32 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 6/6] hugetlb:  update hugetlb documentation for mempolicy based management.

Against:  2.6.31-rc7-mmotm-090827-1651

V2:  Add brief description of per node attributes.

V6:  address review comments

This patch updates the kernel huge tlb documentation to describe the
numa memory policy based huge page management.  Additionaly, the patch
includes a fair amount of rework to improve consistency, eliminate
duplication and set the context for documenting the memory policy
interaction.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: David Rientjes <rientjes@google.com>

 Documentation/vm/hugetlbpage.txt |  263 +++++++++++++++++++++++++--------------
 1 file changed, 175 insertions(+), 88 deletions(-)

Index: linux-2.6.31-rc7-mmotm-090827-1651/Documentation/vm/hugetlbpage.txt
===================================================================
--- linux-2.6.31-rc7-mmotm-090827-1651.orig/Documentation/vm/hugetlbpage.txt	2009-09-09 11:57:26.000000000 -0400
+++ linux-2.6.31-rc7-mmotm-090827-1651/Documentation/vm/hugetlbpage.txt	2009-09-09 11:57:37.000000000 -0400
@@ -11,23 +11,21 @@ This optimization is more critical now a
 (several GBs) are more readily available.
 
 Users can use the huge page support in Linux kernel by either using the mmap
-system call or standard SYSv shared memory system calls (shmget, shmat).
+system call or standard SYSV shared memory system calls (shmget, shmat).
 
 First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
 (present under "File systems") and CONFIG_HUGETLB_PAGE (selected
 automatically when CONFIG_HUGETLBFS is selected) configuration
 options.
 
-The kernel built with huge page support should show the number of configured
-huge pages in the system by running the "cat /proc/meminfo" command.
+The /proc/meminfo file provides information about the total number of
+persistent hugetlb pages in the kernel's huge page pool.  It also displays
+information about the number of free, reserved and surplus huge pages and the
+default huge page size.  The huge page size is needed for generating the
+proper alignment and size of the arguments to system calls that map huge page
+regions.
 
-/proc/meminfo also provides information about the total number of hugetlb
-pages configured in the kernel.  It also displays information about the
-number of free hugetlb pages at any time.  It also displays information about
-the configured huge page size - this is needed for generating the proper
-alignment and size of the arguments to the above system calls.
-
-The output of "cat /proc/meminfo" will have lines like:
+The output of "cat /proc/meminfo" will include lines like:
 
 .....
 HugePages_Total: vvv
@@ -53,59 +51,63 @@ HugePages_Surp  is short for "surplus,"
 /proc/filesystems should also show a filesystem of type "hugetlbfs" configured
 in the kernel.
 
-/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
-pages in the kernel.  Super user can dynamically request more (or free some
-pre-configured) huge pages.
-The allocation (or deallocation) of hugetlb pages is possible only if there are
-enough physically contiguous free pages in system (freeing of huge pages is
-possible only if there are enough hugetlb pages free that can be transferred
-back to regular memory pool).
-
-Pages that are used as hugetlb pages are reserved inside the kernel and cannot
-be used for other purposes.
-
-Once the kernel with Hugetlb page support is built and running, a user can
-use either the mmap system call or shared memory system calls to start using
-the huge pages.  It is required that the system administrator preallocate
-enough memory for huge page purposes.
-
-The administrator can preallocate huge pages on the kernel boot command line by
-specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
-requested.  This is the most reliable method for preallocating huge pages as
-memory has not yet become fragmented.
+/proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge
+pages in the kernel's huge page pool.  "Persistent" huge pages will be
+returned to the huge page pool when freed by a task.  A user with root
+privileges can dynamically allocate more or free some persistent huge pages
+by increasing or decreasing the value of 'nr_hugepages'.
+
+Pages that are used as huge pages are reserved inside the kernel and cannot
+be used for other purposes.  Huge pages cannot be swapped out under
+memory pressure.
+
+Once a number of huge pages have been pre-allocated to the kernel huge page
+pool, a user with appropriate privilege can use either the mmap system call
+or shared memory system calls to use the huge pages.  See the discussion of
+Using Huge Pages, below.
+
+The administrator can allocate persistent huge pages on the kernel boot
+command line by specifying the "hugepages=N" parameter, where 'N' = the
+number of huge pages requested.  This is the most reliable method of
+allocating huge pages as memory has not yet become fragmented.
 
-Some platforms support multiple huge page sizes.  To preallocate huge pages
+Some platforms support multiple huge page sizes.  To allocate huge pages
 of a specific size, one must preceed the huge pages boot command parameters
 with a huge page size selection parameter "hugepagesz=<size>".  <size> must
 be specified in bytes with optional scale suffix [kKmMgG].  The default huge
 page size may be selected with the "default_hugepagesz=<size>" boot parameter.
 
-/proc/sys/vm/nr_hugepages indicates the current number of configured [default
-size] hugetlb pages in the kernel.  Super user can dynamically request more
-(or free some pre-configured) huge pages.
-
-Use the following command to dynamically allocate/deallocate default sized
-huge pages:
+When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
+indicates the current number of pre-allocated huge pages of the default size.
+Thus, one can use the following command to dynamically allocate/deallocate
+default sized persistent huge pages:
 
 	echo 20 > /proc/sys/vm/nr_hugepages
 
-This command will try to configure 20 default sized huge pages in the system.
+This command will try to adjust the number of default sized huge pages in the
+huge page pool to 20, allocating or freeing huge pages, as required.
+
 On a NUMA platform, the kernel will attempt to distribute the huge page pool
-over the all on-line nodes.  These huge pages, allocated when nr_hugepages
-is increased, are called "persistent huge pages".
+over all the set of allowed nodes specified by the NUMA memory policy of the
+task that modifies nr_hugepages.  The default for the allowed nodes--when the
+task has default memory policy--is all on-line nodes.  Allowed nodes with
+insufficient available, contiguous memory for a huge page will be silently
+skipped when allocating persistent huge pages.  See the discussion below of
+the interaction of task memory policy, cpusets and per node attributes with
+the allocation and freeing of persistent huge pages.
 
 The success or failure of huge page allocation depends on the amount of
-physically contiguous memory that is preset in system at the time of the
+physically contiguous memory that is present in system at the time of the
 allocation attempt.  If the kernel is unable to allocate huge pages from
 some nodes in a NUMA system, it will attempt to make up the difference by
 allocating extra pages on other nodes with sufficient available contiguous
 memory, if any.
 
-System administrators may want to put this command in one of the local rc init
-files.  This will enable the kernel to request huge pages early in the boot
-process when the possibility of getting physical contiguous pages is still
-very high.  Administrators can verify the number of huge pages actually
-allocated by checking the sysctl or meminfo.  To check the per node
+System administrators may want to put this command in one of the local rc
+init files.  This will enable the kernel to allocate huge pages early in
+the boot process when the possibility of getting physical contiguous pages
+is still very high.  Administrators can verify the number of huge pages
+actually allocated by checking the sysctl or meminfo.  To check the per node
 distribution of huge pages in a NUMA system, use:
 
 	cat /sys/devices/system/node/node*/meminfo | fgrep Huge
@@ -113,39 +115,40 @@ distribution of huge pages in a NUMA sys
 /proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
 huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
 requested by applications.  Writing any non-zero value into this file
-indicates that the hugetlb subsystem is allowed to try to obtain "surplus"
-huge pages from the buddy allocator, when the normal pool is exhausted. As
-these surplus huge pages go out of use, they are freed back to the buddy
-allocator.
+indicates that the hugetlb subsystem is allowed to try to obtain that
+number of "surplus" huge pages from the kernel's normal page pool, when the
+persistent huge page pool is exhausted. As these surplus huge pages become
+unused, they are freed back to the kernel's normal page pool.
 
-When increasing the huge page pool size via nr_hugepages, any surplus
+When increasing the huge page pool size via nr_hugepages, any existing surplus
 pages will first be promoted to persistent huge pages.  Then, additional
 huge pages will be allocated, if necessary and if possible, to fulfill
-the new huge page pool size.
+the new persistent huge page pool size.
 
-The administrator may shrink the pool of preallocated huge pages for
+The administrator may shrink the pool of persistent huge pages for
 the default huge page size by setting the nr_hugepages sysctl to a
 smaller value.  The kernel will attempt to balance the freeing of huge pages
-across all on-line nodes.  Any free huge pages on the selected nodes will
-be freed back to the buddy allocator.
-
-Caveat: Shrinking the pool via nr_hugepages such that it becomes less
-than the number of huge pages in use will convert the balance to surplus
-huge pages even if it would exceed the overcommit value.  As long as
-this condition holds, however, no more surplus huge pages will be
-allowed on the system until one of the two sysctls are increased
-sufficiently, or the surplus huge pages go out of use and are freed.
+across all nodes in the memory policy of the task modifying nr_hugepages.
+Any free huge pages on the selected nodes will be freed back to the kernel's
+normal page pool.
+
+Caveat: Shrinking the persistent huge page pool via nr_hugepages such that
+it becomes less than the number of huge pages in use will convert the balance
+of the in-use huge pages to surplus huge pages.  This will occur even if
+the number of surplus pages it would exceed the overcommit value.  As long as
+this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is
+increased sufficiently, or the surplus huge pages go out of use and are freed--
+no more surplus huge pages will be allowed to be allocated.
 
 With support for multiple huge page pools at run-time available, much of
-the huge page userspace interface has been duplicated in sysfs. The above
-information applies to the default huge page size which will be
-controlled by the /proc interfaces for backwards compatibility. The root
-huge page control directory in sysfs is:
+the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs.
+The /proc interfaces discussed above have been retained for backwards
+compatibility. The root huge page control directory in sysfs is:
 
 	/sys/kernel/mm/hugepages
 
 For each huge page size supported by the running kernel, a subdirectory
-will exist, of the form
+will exist, of the form:
 
 	hugepages-${size}kB
 
@@ -159,6 +162,98 @@ Inside each of these directories, the sa
 
 which function as described above for the default huge page-sized case.
 
+
+Interaction of Task Memory Policy with Huge Page Allocation/Freeing:
+
+Whether huge pages are allocated and freed via the /proc interface or
+the /sysfs interface, the NUMA nodes from which huge pages are allocated
+or freed are controlled by the NUMA memory policy of the task that modifies
+the nr_hugepages parameter.  [nr_overcommit_hugepages is a global limit.]
+
+The recommended method to allocate or free huge pages to/from the kernel
+huge page pool, using the nr_hugepages example above, is:
+
+    numactl --interleave <node-list> echo 20 >/proc/sys/vm/nr_hugepages
+
+or, more succinctly:
+
+    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages
+
+This will allocate or free abs(20 - nr_hugepages) to or from the nodes
+specified in <node-list>, depending on whether nr_hugepages is initially
+less than or greater than 20, respectively.  No huge pages will be
+allocated nor freed on any node not included in the specified <node-list>.
+
+Any memory policy mode--bind, preferred, local or interleave--may be
+used.  The effect on persistent huge page allocation is as follows:
+
+1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
+   persistent huge pages will be distributed across the node or nodes
+   specified in the mempolicy as if "interleave" had been specified.
+   However, if a node in the policy does not contain sufficient contiguous
+   memory for a huge page, the allocation will not "fallback" to the nearest
+   neighbor node with sufficient contiguous memory.  To do this would cause
+   undesirable imbalance in the distribution of the huge page pool, or
+   possibly, allocation of persistent huge pages on nodes not allowed by
+   the task's memory policy.
+
+2) One or more nodes may be specified with the bind or interleave policy.
+   If more than one node is specified with the preferred policy, only the
+   lowest numeric id will be used.  Local policy will select the node where
+   the task is running at the time the nodes_allowed mask is constructed.
+
+3) For local policy to be deterministic, the task must be bound to a cpu or
+   cpus in a single node.  Otherwise, the task could be migrated to some
+   other node at any time after launch and the resulting node will be
+   indeterminate.  Thus, local policy is not very useful for this purpose.
+   Any of the other mempolicy modes may be used to specify a single node.
+
+4) The nodes allowed mask will be derived from any non-default task mempolicy,
+   whether this policy was set explicitly by the task itself or one of its
+   ancestors, such as numactl.  This means that if the task is invoked from a
+   shell with non-default policy, that policy will be used.  One can specify a
+   node list of "all" with numactl --interleave or --membind [-m] to achieve
+   interleaving over all nodes in the system or cpuset.
+
+5) Any task mempolicy specifed--e.g., using numactl--will be constrained by
+   the resource limits of any cpuset in which the task runs.  Thus, there will
+   be no way for a task with non-default policy running in a cpuset with a
+   subset of the system nodes to allocate huge pages outside the cpuset
+   without first moving to a cpuset that contains all of the desired nodes.
+
+6) Boot-time huge page allocation attempts to distribute the requested number
+   of huge pages over all on-lines nodes.
+
+Per Node Hugepages Attributes
+
+A subset of the contents of the root huge page control directory in sysfs,
+described above, has been replicated under each "node" system device in:
+
+	/sys/devices/system/node/node[0-9]*/hugepages/
+
+Under this directory, the subdirectory for each supported huge page size
+contains the following attribute files:
+
+	nr_hugepages
+	free_hugepages
+	surplus_hugepages
+
+The free_' and surplus_' attribute files are read-only.  They return the number
+of free and surplus [overcommitted] huge pages, respectively, on the parent
+node.
+
+The nr_hugepages attribute will return the total number of huge pages on the
+specified node.  When this attribute is written, the number of persistent huge
+pages on the parent node will be adjusted to the specified value, if sufficient
+resources exist, regardless of the task's mempolicy or cpuset constraints.
+
+Note that the number of overcommit and reserve pages remain global quantities,
+as we don't know until fault time, when the faulting task's mempolicy is applied,
+from which node the huge page allocation will be attempted.
+
+
+Using Huge Pages:
+
 If the user applications are going to request huge pages using mmap system
 call, then it is required that system administrator mount a file system of
 type hugetlbfs:
@@ -206,9 +301,11 @@ map_hugetlb.c.
  * requesting huge pages.
  *
  * For the ia64 architecture, the Linux kernel reserves Region number 4 for
- * huge pages.  That means the addresses starting with 0x800000... will need
- * to be specified.  Specifying a fixed address is not required on ppc64,
- * i386 or x86_64.
+ * huge pages.  That means that if one requires a fixed address, a huge page
+ * aligned address starting with 0x800000... will be required.  If a fixed
+ * address is not required, the kernel will select an address in the proper
+ * range.
+ * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
  *
  * Note: The default shared memory limit is quite low on many kernels,
  * you may need to increase it via:
@@ -237,14 +334,8 @@ map_hugetlb.c.
 
 #define dprintf(x)  printf(x)
 
-/* Only ia64 requires this */
-#ifdef __ia64__
-#define ADDR (void *)(0x8000000000000000UL)
-#define SHMAT_FLAGS (SHM_RND)
-#else
-#define ADDR (void *)(0x0UL)
+#define ADDR (void *)(0x0UL)	/* let kernel choose address */
 #define SHMAT_FLAGS (0)
-#endif
 
 int main(void)
 {
@@ -302,10 +393,12 @@ int main(void)
  * example, the app is requesting memory of size 256MB that is backed by
  * huge pages.
  *
- * For ia64 architecture, Linux kernel reserves Region number 4 for huge pages.
- * That means the addresses starting with 0x800000... will need to be
- * specified.  Specifying a fixed address is not required on ppc64, i386
- * or x86_64.
+ * For the ia64 architecture, the Linux kernel reserves Region number 4 for
+ * huge pages.  That means that if one requires a fixed address, a huge page
+ * aligned address starting with 0x800000... will be required.  If a fixed
+ * address is not required, the kernel will select an address in the proper
+ * range.
+ * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
  */
 #include <stdlib.h>
 #include <stdio.h>
@@ -317,14 +410,8 @@ int main(void)
 #define LENGTH (256UL*1024*1024)
 #define PROTECTION (PROT_READ | PROT_WRITE)
 
-/* Only ia64 requires this */
-#ifdef __ia64__
-#define ADDR (void *)(0x8000000000000000UL)
-#define FLAGS (MAP_SHARED | MAP_FIXED)
-#else
-#define ADDR (void *)(0x0UL)
+#define ADDR (void *)(0x0UL)	/* let kernel choose address */
 #define FLAGS (MAP_SHARED)
-#endif
 
 void check_bytes(char *addr)
 {

^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2009-09-16 10:21 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-28 16:03 [PATCH 0/6] hugetlb: V5 constrain allocation/free based on task mempolicy Lee Schermerhorn
2009-08-28 16:03 ` Lee Schermerhorn
2009-08-28 16:03 ` [PATCH 1/6] hugetlb: rework hstate_next_node_* functions Lee Schermerhorn
2009-08-28 16:03   ` Lee Schermerhorn
2009-08-28 16:03 ` [PATCH 2/6] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns Lee Schermerhorn
2009-08-28 16:03   ` Lee Schermerhorn
2009-09-03 18:39   ` David Rientjes
2009-09-03 18:39     ` David Rientjes
2009-08-28 16:03 ` [PATCH 3/6] hugetlb: derive huge pages nodes allowed from task mempolicy Lee Schermerhorn
2009-08-28 16:03   ` Lee Schermerhorn
2009-09-01 14:47   ` Mel Gorman
2009-09-01 14:47     ` Mel Gorman
2009-09-03 19:22   ` David Rientjes
2009-09-03 19:22     ` David Rientjes
2009-09-03 20:15     ` Lee Schermerhorn
2009-09-03 20:15       ` Lee Schermerhorn
2009-09-03 20:49       ` David Rientjes
2009-09-03 20:49         ` David Rientjes
2009-08-28 16:03 ` [PATCH 4/6] hugetlb: introduce alloc_nodemask_of_node Lee Schermerhorn
2009-08-28 16:03   ` Lee Schermerhorn
2009-09-01 14:49   ` Mel Gorman
2009-09-01 14:49     ` Mel Gorman
2009-09-01 16:42     ` Lee Schermerhorn
2009-09-01 16:42       ` Lee Schermerhorn
2009-09-03 18:34       ` David Rientjes
2009-09-03 18:34         ` David Rientjes
2009-09-03 20:49         ` Lee Schermerhorn
2009-09-03 21:03           ` David Rientjes
2009-09-03 21:03             ` David Rientjes
2009-08-28 16:03 ` [PATCH 5/6] hugetlb: add per node hstate attributes Lee Schermerhorn
2009-08-28 16:03   ` Lee Schermerhorn
2009-09-01 15:20   ` Mel Gorman
2009-09-01 15:20     ` Mel Gorman
2009-09-03 19:52   ` David Rientjes
2009-09-03 19:52     ` David Rientjes
2009-09-03 20:41     ` Lee Schermerhorn
2009-09-03 20:41       ` Lee Schermerhorn
2009-09-03 21:02       ` David Rientjes
2009-09-03 21:02         ` David Rientjes
2009-09-04 14:30         ` Lee Schermerhorn
2009-09-04 14:30           ` Lee Schermerhorn
2009-08-28 16:03 ` [PATCH 6/6] hugetlb: update hugetlb documentation for mempolicy based management Lee Schermerhorn
2009-08-28 16:03   ` Lee Schermerhorn
2009-09-03 20:07   ` David Rientjes
2009-09-03 20:07     ` David Rientjes
2009-09-03 21:09     ` Lee Schermerhorn
2009-09-03 21:09       ` Lee Schermerhorn
2009-09-03 21:25       ` David Rientjes
2009-09-08 10:44         ` Mel Gorman
2009-09-08 10:44           ` Mel Gorman
2009-09-08 19:51           ` David Rientjes
2009-09-08 20:04             ` Mel Gorman
2009-09-08 20:04               ` Mel Gorman
2009-09-08 20:18               ` David Rientjes
2009-09-08 21:41                 ` Mel Gorman
2009-09-08 21:41                   ` Mel Gorman
2009-09-08 22:54                   ` David Rientjes
2009-09-09  8:16                     ` Mel Gorman
2009-09-09  8:16                       ` Mel Gorman
2009-09-09 20:44                       ` David Rientjes
2009-09-10 12:26                         ` Mel Gorman
2009-09-10 12:26                           ` Mel Gorman
2009-09-11 22:27                           ` David Rientjes
2009-09-11 22:27                             ` David Rientjes
2009-09-14 13:33                             ` Mel Gorman
2009-09-14 14:15                               ` Lee Schermerhorn
2009-09-14 14:15                                 ` Lee Schermerhorn
2009-09-14 15:41                                 ` Mel Gorman
2009-09-14 15:41                                   ` Mel Gorman
2009-09-14 19:15                                   ` David Rientjes
2009-09-14 19:15                                     ` David Rientjes
2009-09-15 11:48                                     ` Mel Gorman
2009-09-15 11:48                                       ` Mel Gorman
2009-09-14 19:14                               ` David Rientjes
2009-09-14 19:14                                 ` David Rientjes
2009-09-14 21:28                                 ` David Rientjes
2009-09-16 10:21                                   ` Mel Gorman
2009-09-03 20:42   ` Randy Dunlap
2009-09-04 15:23     ` Lee Schermerhorn
2009-09-09 16:31 [PATCH 0/6] hugetlb: V6 constrain allocation/free based on task mempolicy Lee Schermerhorn
2009-09-09 16:32 ` [PATCH 6/6] hugetlb: update hugetlb documentation for mempolicy based management Lee Schermerhorn
2009-09-09 16:32   ` Lee Schermerhorn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.