[PATCH 0/11] hugetlb: V7 constrain allocation/free based on task mempolicy

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/11] hugetlb: V7 constrain allocation/free based on task mempolicy
@ 2009-09-15 20:43 Lee Schermerhorn
  2009-09-15 20:43   ` Lee Schermerhorn
                   ` (10 more replies)
  0 siblings, 11 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-15 20:43 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

PATCH 0/11 hugetlb: numa control of persistent huge pages alloc/free

Against:  2.6.31-mmotm-090914-0157

This is V7 of a series of patches to provide control over the location
of the allocation and freeing of persistent huge pages on a NUMA
platform.   Please consider [at least patches 1-8] for merging into mmotm.

This series uses two mechanisms to constrain the nodes from which
persistent huge pages are allocated:  1) the task NUMA mempolicy of
the task modifying  a new sysctl "nr_hugepages_mempolicy" [patch 8],
based on a suggestion by Mel Gorman; and 2) a subset of the hugepages
hstate sysfs attributes have been added [in V4] to each node system
device under:

	/sys/devices/node/node[0-9]*/hugepages.

The per node attibutes allow direct assignment of a huge page
count on a specific node, regardless of the task's mempolicy or
cpuset constraints.

V5 addressed review comments -- changes described in patch descriptions.

V6 addressed more review comments, described in the patches.

V6 also included a 3 patch series that implements an enhancement suggested
by David Rientjes:   the default huge page nodes allowed mask will be the
nodes with memory rather than all on-line nodes and we will allocate per
node hstate attributes only for nodes with memory.  This requires that we
register a memory on/off-line notifier and [un]register the attributes on
transitions to/from memoryless state.

V7 addresses review comments,, described in the patches, and includes a
new patch, originally from Mel Gorman, to define a new vm sysctl and
sysfs global hugepages attribute "nr_hugepages_mempolicy" rather than
apply mempolicy contraints to pool adujstments via the pre-existing
"nr_hugepages".  The 3 patches to restrict hugetlb to visiting only
nodes with memory and to add/remove per node hstate attributes on
memory hotplug complete V7.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/11] hugetlb:  rework hstate_next_node_* functions
  2009-09-15 20:43 [PATCH 0/11] hugetlb: V7 constrain allocation/free based on task mempolicy Lee Schermerhorn
@ 2009-09-15 20:43   ` Lee Schermerhorn
  2009-09-15 20:44   ` Lee Schermerhorn
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-15 20:43 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 1/11] hugetlb:  rework hstate_next_node* functions

Against:  2.6.31-mmotm-090914-0157

V2: + cleaned up comments, removed some deemed unnecessary,
      add some suggested by review
    + removed check for !current in huge_mpol_nodes_allowed().
    + added 'current->comm' to warning message in huge_mpol_nodes_allowed().
    + added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to
      catch out of range node id.
    + add examples to patch description

V3: + factored this "cleanup" patch out of V2 patch 2/3
    + moved ahead of patch to add nodes_allowed mask to alloc funcs
      as this patch is somewhat independent from using task mempolicy
      to control huge page allocation and freeing.

Modify the hstate_next_node* functions to allow them to be called to
obtain the "start_nid".  Then, whereas prior to this patch we
unconditionally called hstate_next_node_to_{alloc|free}(), whether
or not we successfully allocated/freed a huge page on the node,
now we only call these functions on failure to alloc/free to advance
to next allowed node.

Factor out the next_node_allowed() function to handle wrap at end
of node_online_map.  In this version, the allowed nodes include all 
of the online nodes.

Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/hugetlb.c |   70 +++++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 45 insertions(+), 25 deletions(-)

Index: linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/mm/hugetlb.c	2009-09-15 13:23:01.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c	2009-09-15 13:42:14.000000000 -0400
@@ -622,6 +622,20 @@ static struct page *alloc_fresh_huge_pag
 }
 
 /*
+ * common helper function for hstate_next_node_to_{alloc|free}.
+ * return next node in node_online_map, wrapping at end.
+ */
+static int next_node_allowed(int nid)
+{
+	nid = next_node(nid, node_online_map);
+	if (nid == MAX_NUMNODES)
+		nid = first_node(node_online_map);
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+
+	return nid;
+}
+
+/*
  * Use a helper variable to find the next node and then
  * copy it back to next_nid_to_alloc afterwards:
  * otherwise there's a window in which a racer might
@@ -634,12 +648,12 @@ static struct page *alloc_fresh_huge_pag
  */
 static int hstate_next_node_to_alloc(struct hstate *h)
 {
-	int next_nid;
-	next_nid = next_node(h->next_nid_to_alloc, node_online_map);
-	if (next_nid == MAX_NUMNODES)
-		next_nid = first_node(node_online_map);
+	int nid, next_nid;
+
+	nid = h->next_nid_to_alloc;
+	next_nid = next_node_allowed(nid);
 	h->next_nid_to_alloc = next_nid;
-	return next_nid;
+	return nid;
 }
 
 static int alloc_fresh_huge_page(struct hstate *h)
@@ -649,15 +663,17 @@ static int alloc_fresh_huge_page(struct
 	int next_nid;
 	int ret = 0;
 
-	start_nid = h->next_nid_to_alloc;
+	start_nid = hstate_next_node_to_alloc(h);
 	next_nid = start_nid;
 
 	do {
 		page = alloc_fresh_huge_page_node(h, next_nid);
-		if (page)
+		if (page) {
 			ret = 1;
+			break;
+		}
 		next_nid = hstate_next_node_to_alloc(h);
-	} while (!page && next_nid != start_nid);
+	} while (next_nid != start_nid);
 
 	if (ret)
 		count_vm_event(HTLB_BUDDY_PGALLOC);
@@ -668,17 +684,19 @@ static int alloc_fresh_huge_page(struct
 }
 
 /*
- * helper for free_pool_huge_page() - find next node
- * from which to free a huge page
+ * helper for free_pool_huge_page() - return the next node
+ * from which to free a huge page.  Advance the next node id
+ * whether or not we find a free huge page to free so that the
+ * next attempt to free addresses the next node.
  */
 static int hstate_next_node_to_free(struct hstate *h)
 {
-	int next_nid;
-	next_nid = next_node(h->next_nid_to_free, node_online_map);
-	if (next_nid == MAX_NUMNODES)
-		next_nid = first_node(node_online_map);
+	int nid, next_nid;
+
+	nid = h->next_nid_to_free;
+	next_nid = next_node_allowed(nid);
 	h->next_nid_to_free = next_nid;
-	return next_nid;
+	return nid;
 }
 
 /*
@@ -693,7 +711,7 @@ static int free_pool_huge_page(struct hs
 	int next_nid;
 	int ret = 0;
 
-	start_nid = h->next_nid_to_free;
+	start_nid = hstate_next_node_to_free(h);
 	next_nid = start_nid;
 
 	do {
@@ -715,9 +733,10 @@ static int free_pool_huge_page(struct hs
 			}
 			update_and_free_page(h, page);
 			ret = 1;
+			break;
 		}
 		next_nid = hstate_next_node_to_free(h);
-	} while (!ret && next_nid != start_nid);
+	} while (next_nid != start_nid);
 
 	return ret;
 }
@@ -1028,10 +1047,9 @@ int __weak alloc_bootmem_huge_page(struc
 		void *addr;
 
 		addr = __alloc_bootmem_node_nopanic(
-				NODE_DATA(h->next_nid_to_alloc),
+				NODE_DATA(hstate_next_node_to_alloc(h)),
 				huge_page_size(h), huge_page_size(h), 0);
 
-		hstate_next_node_to_alloc(h);
 		if (addr) {
 			/*
 			 * Use the beginning of the huge page to store the
@@ -1167,29 +1185,31 @@ static int adjust_pool_surplus(struct hs
 	VM_BUG_ON(delta != -1 && delta != 1);
 
 	if (delta < 0)
-		start_nid = h->next_nid_to_alloc;
+		start_nid = hstate_next_node_to_alloc(h);
 	else
-		start_nid = h->next_nid_to_free;
+		start_nid = hstate_next_node_to_free(h);
 	next_nid = start_nid;
 
 	do {
 		int nid = next_nid;
 		if (delta < 0)  {
-			next_nid = hstate_next_node_to_alloc(h);
 			/*
 			 * To shrink on this node, there must be a surplus page
 			 */
-			if (!h->surplus_huge_pages_node[nid])
+			if (!h->surplus_huge_pages_node[nid]) {
+				next_nid = hstate_next_node_to_alloc(h);
 				continue;
+			}
 		}
 		if (delta > 0) {
-			next_nid = hstate_next_node_to_free(h);
 			/*
 			 * Surplus cannot exceed the total number of pages
 			 */
 			if (h->surplus_huge_pages_node[nid] >=
-						h->nr_huge_pages_node[nid])
+						h->nr_huge_pages_node[nid]) {
+				next_nid = hstate_next_node_to_free(h);
 				continue;
+			}
 		}
 
 		h->surplus_huge_pages += delta;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/11] hugetlb:  rework hstate_next_node_* functions
@ 2009-09-15 20:43   ` Lee Schermerhorn
  0 siblings, 0 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-15 20:43 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 1/11] hugetlb:  rework hstate_next_node* functions

Against:  2.6.31-mmotm-090914-0157

V2: + cleaned up comments, removed some deemed unnecessary,
      add some suggested by review
    + removed check for !current in huge_mpol_nodes_allowed().
    + added 'current->comm' to warning message in huge_mpol_nodes_allowed().
    + added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to
      catch out of range node id.
    + add examples to patch description

V3: + factored this "cleanup" patch out of V2 patch 2/3
    + moved ahead of patch to add nodes_allowed mask to alloc funcs
      as this patch is somewhat independent from using task mempolicy
      to control huge page allocation and freeing.

Modify the hstate_next_node* functions to allow them to be called to
obtain the "start_nid".  Then, whereas prior to this patch we
unconditionally called hstate_next_node_to_{alloc|free}(), whether
or not we successfully allocated/freed a huge page on the node,
now we only call these functions on failure to alloc/free to advance
to next allowed node.

Factor out the next_node_allowed() function to handle wrap at end
of node_online_map.  In this version, the allowed nodes include all 
of the online nodes.

Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/hugetlb.c |   70 +++++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 45 insertions(+), 25 deletions(-)

Index: linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/mm/hugetlb.c	2009-09-15 13:23:01.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c	2009-09-15 13:42:14.000000000 -0400
@@ -622,6 +622,20 @@ static struct page *alloc_fresh_huge_pag
 }
 
 /*
+ * common helper function for hstate_next_node_to_{alloc|free}.
+ * return next node in node_online_map, wrapping at end.
+ */
+static int next_node_allowed(int nid)
+{
+	nid = next_node(nid, node_online_map);
+	if (nid == MAX_NUMNODES)
+		nid = first_node(node_online_map);
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+
+	return nid;
+}
+
+/*
  * Use a helper variable to find the next node and then
  * copy it back to next_nid_to_alloc afterwards:
  * otherwise there's a window in which a racer might
@@ -634,12 +648,12 @@ static struct page *alloc_fresh_huge_pag
  */
 static int hstate_next_node_to_alloc(struct hstate *h)
 {
-	int next_nid;
-	next_nid = next_node(h->next_nid_to_alloc, node_online_map);
-	if (next_nid == MAX_NUMNODES)
-		next_nid = first_node(node_online_map);
+	int nid, next_nid;
+
+	nid = h->next_nid_to_alloc;
+	next_nid = next_node_allowed(nid);
 	h->next_nid_to_alloc = next_nid;
-	return next_nid;
+	return nid;
 }
 
 static int alloc_fresh_huge_page(struct hstate *h)
@@ -649,15 +663,17 @@ static int alloc_fresh_huge_page(struct
 	int next_nid;
 	int ret = 0;
 
-	start_nid = h->next_nid_to_alloc;
+	start_nid = hstate_next_node_to_alloc(h);
 	next_nid = start_nid;
 
 	do {
 		page = alloc_fresh_huge_page_node(h, next_nid);
-		if (page)
+		if (page) {
 			ret = 1;
+			break;
+		}
 		next_nid = hstate_next_node_to_alloc(h);
-	} while (!page && next_nid != start_nid);
+	} while (next_nid != start_nid);
 
 	if (ret)
 		count_vm_event(HTLB_BUDDY_PGALLOC);
@@ -668,17 +684,19 @@ static int alloc_fresh_huge_page(struct
 }
 
 /*
- * helper for free_pool_huge_page() - find next node
- * from which to free a huge page
+ * helper for free_pool_huge_page() - return the next node
+ * from which to free a huge page.  Advance the next node id
+ * whether or not we find a free huge page to free so that the
+ * next attempt to free addresses the next node.
  */
 static int hstate_next_node_to_free(struct hstate *h)
 {
-	int next_nid;
-	next_nid = next_node(h->next_nid_to_free, node_online_map);
-	if (next_nid == MAX_NUMNODES)
-		next_nid = first_node(node_online_map);
+	int nid, next_nid;
+
+	nid = h->next_nid_to_free;
+	next_nid = next_node_allowed(nid);
 	h->next_nid_to_free = next_nid;
-	return next_nid;
+	return nid;
 }
 
 /*
@@ -693,7 +711,7 @@ static int free_pool_huge_page(struct hs
 	int next_nid;
 	int ret = 0;
 
-	start_nid = h->next_nid_to_free;
+	start_nid = hstate_next_node_to_free(h);
 	next_nid = start_nid;
 
 	do {
@@ -715,9 +733,10 @@ static int free_pool_huge_page(struct hs
 			}
 			update_and_free_page(h, page);
 			ret = 1;
+			break;
 		}
 		next_nid = hstate_next_node_to_free(h);
-	} while (!ret && next_nid != start_nid);
+	} while (next_nid != start_nid);
 
 	return ret;
 }
@@ -1028,10 +1047,9 @@ int __weak alloc_bootmem_huge_page(struc
 		void *addr;
 
 		addr = __alloc_bootmem_node_nopanic(
-				NODE_DATA(h->next_nid_to_alloc),
+				NODE_DATA(hstate_next_node_to_alloc(h)),
 				huge_page_size(h), huge_page_size(h), 0);
 
-		hstate_next_node_to_alloc(h);
 		if (addr) {
 			/*
 			 * Use the beginning of the huge page to store the
@@ -1167,29 +1185,31 @@ static int adjust_pool_surplus(struct hs
 	VM_BUG_ON(delta != -1 && delta != 1);
 
 	if (delta < 0)
-		start_nid = h->next_nid_to_alloc;
+		start_nid = hstate_next_node_to_alloc(h);
 	else
-		start_nid = h->next_nid_to_free;
+		start_nid = hstate_next_node_to_free(h);
 	next_nid = start_nid;
 
 	do {
 		int nid = next_nid;
 		if (delta < 0)  {
-			next_nid = hstate_next_node_to_alloc(h);
 			/*
 			 * To shrink on this node, there must be a surplus page
 			 */
-			if (!h->surplus_huge_pages_node[nid])
+			if (!h->surplus_huge_pages_node[nid]) {
+				next_nid = hstate_next_node_to_alloc(h);
 				continue;
+			}
 		}
 		if (delta > 0) {
-			next_nid = hstate_next_node_to_free(h);
 			/*
 			 * Surplus cannot exceed the total number of pages
 			 */
 			if (h->surplus_huge_pages_node[nid] >=
-						h->nr_huge_pages_node[nid])
+						h->nr_huge_pages_node[nid]) {
+				next_nid = hstate_next_node_to_free(h);
 				continue;
+			}
 		}
 
 		h->surplus_huge_pages += delta;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 2/11] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns
  2009-09-15 20:43 [PATCH 0/11] hugetlb: V7 constrain allocation/free based on task mempolicy Lee Schermerhorn
@ 2009-09-15 20:44   ` Lee Schermerhorn
  2009-09-15 20:44   ` Lee Schermerhorn
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-15 20:44 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 2/11] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns

Against:  2.6.31-mmotm-090914-0157

V3: + moved this patch to after the "rework" of hstate_next_node_to_...
      functions as this patch is more specific to using task mempolicy
      to control huge page allocation and freeing.

V5: + removed now unneeded 'nextnid' from hstate_next_node_to_{alloc|free}
      and updated the stale comments.

V6: + move defaulting of nodes_allowed [to &node_online_map] up to
      set_max_huge_pages().  Eliminate from hstate_next_node_*()
      functions.  [David Rientjes' suggestion].
    + renamed "this_node_allowed()" to "get_valid_node_allowed()"
      [for David]

In preparation for constraining huge page allocation and freeing by the
controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer
to the allocate, free and surplus adjustment functions.  For now, pass
NULL to indicate default behavior--i.e., use node_online_map.  A
subsqeuent patch will derive a non-default mask from the controlling 
task's numa mempolicy.

Note that this method of updating the global hstate nr_hugepages under
the constraint of a nodemask simplifies keeping the global state 
consistent--especially the number of persistent and surplus pages
relative to reservations and overcommit limits.  There are undoubtedly
other ways to do this, but this works for both interfaces:  mempolicy
and per node attributes.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: David Rientjes <rientjes@google.com>

 mm/hugetlb.c |  120 ++++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 71 insertions(+), 49 deletions(-)

Index: linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/mm/hugetlb.c	2009-09-15 13:42:14.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c	2009-09-15 13:42:17.000000000 -0400
@@ -622,48 +622,56 @@ static struct page *alloc_fresh_huge_pag
 }
 
 /*
- * common helper function for hstate_next_node_to_{alloc|free}.
- * return next node in node_online_map, wrapping at end.
+ * common helper functions for hstate_next_node_to_{alloc|free}.
+ * We may have allocated or freed a huge page based on a different
+ * nodes_allowed previously, so h->next_node_to_{alloc|free} might
+ * be outside of *nodes_allowed.  Ensure that we use an allowed
+ * node for alloc or free.
  */
-static int next_node_allowed(int nid)
+static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
 {
-	nid = next_node(nid, node_online_map);
+	nid = next_node(nid, *nodes_allowed);
 	if (nid == MAX_NUMNODES)
-		nid = first_node(node_online_map);
+		nid = first_node(*nodes_allowed);
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 
 	return nid;
 }
 
+static int get_valid_node_allowed(int nid, nodemask_t *nodes_allowed)
+{
+	if (!node_isset(nid, *nodes_allowed))
+		nid = next_node_allowed(nid, nodes_allowed);
+	return nid;
+}
+
 /*
- * Use a helper variable to find the next node and then
- * copy it back to next_nid_to_alloc afterwards:
- * otherwise there's a window in which a racer might
- * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
- * But we don't need to use a spin_lock here: it really
- * doesn't matter if occasionally a racer chooses the
- * same nid as we do.  Move nid forward in the mask even
- * if we just successfully allocated a hugepage so that
- * the next caller gets hugepages on the next node.
+ * returns the previously saved node ["this node"] from which to
+ * allocate a persistent huge page for the pool and advance the
+ * next node from which to allocate, handling wrap at end of node
+ * mask.
  */
-static int hstate_next_node_to_alloc(struct hstate *h)
+static int hstate_next_node_to_alloc(struct hstate *h,
+					nodemask_t *nodes_allowed)
 {
-	int nid, next_nid;
+	int nid;
+
+	VM_BUG_ON(!nodes_allowed);
+
+	nid = get_valid_node_allowed(h->next_nid_to_alloc, nodes_allowed);
+	h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
 
-	nid = h->next_nid_to_alloc;
-	next_nid = next_node_allowed(nid);
-	h->next_nid_to_alloc = next_nid;
 	return nid;
 }
 
-static int alloc_fresh_huge_page(struct hstate *h)
+static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
 {
 	struct page *page;
 	int start_nid;
 	int next_nid;
 	int ret = 0;
 
-	start_nid = hstate_next_node_to_alloc(h);
+	start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
 	next_nid = start_nid;
 
 	do {
@@ -672,7 +680,7 @@ static int alloc_fresh_huge_page(struct
 			ret = 1;
 			break;
 		}
-		next_nid = hstate_next_node_to_alloc(h);
+		next_nid = hstate_next_node_to_alloc(h, nodes_allowed);
 	} while (next_nid != start_nid);
 
 	if (ret)
@@ -684,18 +692,20 @@ static int alloc_fresh_huge_page(struct
 }
 
 /*
- * helper for free_pool_huge_page() - return the next node
- * from which to free a huge page.  Advance the next node id
- * whether or not we find a free huge page to free so that the
- * next attempt to free addresses the next node.
+ * helper for free_pool_huge_page() - return the previously saved
+ * node ["this node"] from which to free a huge page.  Advance the
+ * next node id whether or not we find a free huge page to free so
+ * that the next attempt to free addresses the next node.
  */
-static int hstate_next_node_to_free(struct hstate *h)
+static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 {
-	int nid, next_nid;
+	int nid;
+
+	VM_BUG_ON(!nodes_allowed);
+
+	nid = get_valid_node_allowed(h->next_nid_to_free, nodes_allowed);
+	h->next_nid_to_free = next_node_allowed(nid, nodes_allowed);
 
-	nid = h->next_nid_to_free;
-	next_nid = next_node_allowed(nid);
-	h->next_nid_to_free = next_nid;
 	return nid;
 }
 
@@ -705,13 +715,14 @@ static int hstate_next_node_to_free(stru
  * balanced over allowed nodes.
  * Called with hugetlb_lock locked.
  */
-static int free_pool_huge_page(struct hstate *h, bool acct_surplus)
+static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
+							 bool acct_surplus)
 {
 	int start_nid;
 	int next_nid;
 	int ret = 0;
 
-	start_nid = hstate_next_node_to_free(h);
+	start_nid = hstate_next_node_to_free(h, nodes_allowed);
 	next_nid = start_nid;
 
 	do {
@@ -735,7 +746,7 @@ static int free_pool_huge_page(struct hs
 			ret = 1;
 			break;
 		}
-		next_nid = hstate_next_node_to_free(h);
+		next_nid = hstate_next_node_to_free(h, nodes_allowed);
 	} while (next_nid != start_nid);
 
 	return ret;
@@ -937,7 +948,7 @@ static void return_unused_surplus_pages(
 	 * on-line nodes for us and will handle the hstate accounting.
 	 */
 	while (nr_pages--) {
-		if (!free_pool_huge_page(h, 1))
+		if (!free_pool_huge_page(h, &node_online_map, 1))
 			break;
 	}
 }
@@ -1047,7 +1058,7 @@ int __weak alloc_bootmem_huge_page(struc
 		void *addr;
 
 		addr = __alloc_bootmem_node_nopanic(
-				NODE_DATA(hstate_next_node_to_alloc(h)),
+				NODE_DATA(hstate_next_node_to_alloc(h, NULL)),
 				huge_page_size(h), huge_page_size(h), 0);
 
 		if (addr) {
@@ -1102,7 +1113,7 @@ static void __init hugetlb_hstate_alloc_
 		if (h->order >= MAX_ORDER) {
 			if (!alloc_bootmem_huge_page(h))
 				break;
-		} else if (!alloc_fresh_huge_page(h))
+		} else if (!alloc_fresh_huge_page(h, &node_online_map))
 			break;
 	}
 	h->max_huge_pages = i;
@@ -1144,16 +1155,22 @@ static void __init report_hugepages(void
 }
 
 #ifdef CONFIG_HIGHMEM
-static void try_to_free_low(struct hstate *h, unsigned long count)
+static void try_to_free_low(struct hstate *h, unsigned long count,
+						nodemask_t *nodes_allowed)
 {
 	int i;
 
 	if (h->order >= MAX_ORDER)
 		return;
 
+	if (!nodes_allowed)
+		nodes_allowed = &node_online_map;
+
 	for (i = 0; i < MAX_NUMNODES; ++i) {
 		struct page *page, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
+		if (!node_isset(i, *nodes_allowed))
+			continue;
 		list_for_each_entry_safe(page, next, freel, lru) {
 			if (count >= h->nr_huge_pages)
 				return;
@@ -1167,7 +1184,8 @@ static void try_to_free_low(struct hstat
 	}
 }
 #else
-static inline void try_to_free_low(struct hstate *h, unsigned long count)
+static inline void try_to_free_low(struct hstate *h, unsigned long count,
+						nodemask_t *nodes_allowed)
 {
 }
 #endif
@@ -1177,7 +1195,8 @@ static inline void try_to_free_low(struc
  * balanced by operating on them in a round-robin fashion.
  * Returns 1 if an adjustment was made.
  */
-static int adjust_pool_surplus(struct hstate *h, int delta)
+static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
+				int delta)
 {
 	int start_nid, next_nid;
 	int ret = 0;
@@ -1185,9 +1204,9 @@ static int adjust_pool_surplus(struct hs
 	VM_BUG_ON(delta != -1 && delta != 1);
 
 	if (delta < 0)
-		start_nid = hstate_next_node_to_alloc(h);
+		start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
 	else
-		start_nid = hstate_next_node_to_free(h);
+		start_nid = hstate_next_node_to_free(h, nodes_allowed);
 	next_nid = start_nid;
 
 	do {
@@ -1197,7 +1216,8 @@ static int adjust_pool_surplus(struct hs
 			 * To shrink on this node, there must be a surplus page
 			 */
 			if (!h->surplus_huge_pages_node[nid]) {
-				next_nid = hstate_next_node_to_alloc(h);
+				next_nid = hstate_next_node_to_alloc(h,
+								nodes_allowed);
 				continue;
 			}
 		}
@@ -1207,7 +1227,8 @@ static int adjust_pool_surplus(struct hs
 			 */
 			if (h->surplus_huge_pages_node[nid] >=
 						h->nr_huge_pages_node[nid]) {
-				next_nid = hstate_next_node_to_free(h);
+				next_nid = hstate_next_node_to_free(h,
+								nodes_allowed);
 				continue;
 			}
 		}
@@ -1225,6 +1246,7 @@ static int adjust_pool_surplus(struct hs
 static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
 {
 	unsigned long min_count, ret;
+	nodemask_t *nodes_allowed = &node_online_map;
 
 	if (h->order >= MAX_ORDER)
 		return h->max_huge_pages;
@@ -1242,7 +1264,7 @@ static unsigned long set_max_huge_pages(
 	 */
 	spin_lock(&hugetlb_lock);
 	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, -1))
+		if (!adjust_pool_surplus(h, nodes_allowed, -1))
 			break;
 	}
 
@@ -1253,7 +1275,7 @@ static unsigned long set_max_huge_pages(
 		 * and reducing the surplus.
 		 */
 		spin_unlock(&hugetlb_lock);
-		ret = alloc_fresh_huge_page(h);
+		ret = alloc_fresh_huge_page(h, nodes_allowed);
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -1277,13 +1299,13 @@ static unsigned long set_max_huge_pages(
 	 */
 	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
 	min_count = max(count, min_count);
-	try_to_free_low(h, min_count);
+	try_to_free_low(h, min_count, nodes_allowed);
 	while (min_count < persistent_huge_pages(h)) {
-		if (!free_pool_huge_page(h, 0))
+		if (!free_pool_huge_page(h, nodes_allowed, 0))
 			break;
 	}
 	while (count < persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, 1))
+		if (!adjust_pool_surplus(h, nodes_allowed, 1))
 			break;
 	}
 out:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 2/11] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns
@ 2009-09-15 20:44   ` Lee Schermerhorn
  0 siblings, 0 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-15 20:44 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 2/11] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns

Against:  2.6.31-mmotm-090914-0157

V3: + moved this patch to after the "rework" of hstate_next_node_to_...
      functions as this patch is more specific to using task mempolicy
      to control huge page allocation and freeing.

V5: + removed now unneeded 'nextnid' from hstate_next_node_to_{alloc|free}
      and updated the stale comments.

V6: + move defaulting of nodes_allowed [to &node_online_map] up to
      set_max_huge_pages().  Eliminate from hstate_next_node_*()
      functions.  [David Rientjes' suggestion].
    + renamed "this_node_allowed()" to "get_valid_node_allowed()"
      [for David]

In preparation for constraining huge page allocation and freeing by the
controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer
to the allocate, free and surplus adjustment functions.  For now, pass
NULL to indicate default behavior--i.e., use node_online_map.  A
subsqeuent patch will derive a non-default mask from the controlling 
task's numa mempolicy.

Note that this method of updating the global hstate nr_hugepages under
the constraint of a nodemask simplifies keeping the global state 
consistent--especially the number of persistent and surplus pages
relative to reservations and overcommit limits.  There are undoubtedly
other ways to do this, but this works for both interfaces:  mempolicy
and per node attributes.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: David Rientjes <rientjes@google.com>

 mm/hugetlb.c |  120 ++++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 71 insertions(+), 49 deletions(-)

Index: linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/mm/hugetlb.c	2009-09-15 13:42:14.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c	2009-09-15 13:42:17.000000000 -0400
@@ -622,48 +622,56 @@ static struct page *alloc_fresh_huge_pag
 }
 
 /*
- * common helper function for hstate_next_node_to_{alloc|free}.
- * return next node in node_online_map, wrapping at end.
+ * common helper functions for hstate_next_node_to_{alloc|free}.
+ * We may have allocated or freed a huge page based on a different
+ * nodes_allowed previously, so h->next_node_to_{alloc|free} might
+ * be outside of *nodes_allowed.  Ensure that we use an allowed
+ * node for alloc or free.
  */
-static int next_node_allowed(int nid)
+static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
 {
-	nid = next_node(nid, node_online_map);
+	nid = next_node(nid, *nodes_allowed);
 	if (nid == MAX_NUMNODES)
-		nid = first_node(node_online_map);
+		nid = first_node(*nodes_allowed);
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 
 	return nid;
 }
 
+static int get_valid_node_allowed(int nid, nodemask_t *nodes_allowed)
+{
+	if (!node_isset(nid, *nodes_allowed))
+		nid = next_node_allowed(nid, nodes_allowed);
+	return nid;
+}
+
 /*
- * Use a helper variable to find the next node and then
- * copy it back to next_nid_to_alloc afterwards:
- * otherwise there's a window in which a racer might
- * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
- * But we don't need to use a spin_lock here: it really
- * doesn't matter if occasionally a racer chooses the
- * same nid as we do.  Move nid forward in the mask even
- * if we just successfully allocated a hugepage so that
- * the next caller gets hugepages on the next node.
+ * returns the previously saved node ["this node"] from which to
+ * allocate a persistent huge page for the pool and advance the
+ * next node from which to allocate, handling wrap at end of node
+ * mask.
  */
-static int hstate_next_node_to_alloc(struct hstate *h)
+static int hstate_next_node_to_alloc(struct hstate *h,
+					nodemask_t *nodes_allowed)
 {
-	int nid, next_nid;
+	int nid;
+
+	VM_BUG_ON(!nodes_allowed);
+
+	nid = get_valid_node_allowed(h->next_nid_to_alloc, nodes_allowed);
+	h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
 
-	nid = h->next_nid_to_alloc;
-	next_nid = next_node_allowed(nid);
-	h->next_nid_to_alloc = next_nid;
 	return nid;
 }
 
-static int alloc_fresh_huge_page(struct hstate *h)
+static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
 {
 	struct page *page;
 	int start_nid;
 	int next_nid;
 	int ret = 0;
 
-	start_nid = hstate_next_node_to_alloc(h);
+	start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
 	next_nid = start_nid;
 
 	do {
@@ -672,7 +680,7 @@ static int alloc_fresh_huge_page(struct
 			ret = 1;
 			break;
 		}
-		next_nid = hstate_next_node_to_alloc(h);
+		next_nid = hstate_next_node_to_alloc(h, nodes_allowed);
 	} while (next_nid != start_nid);
 
 	if (ret)
@@ -684,18 +692,20 @@ static int alloc_fresh_huge_page(struct
 }
 
 /*
- * helper for free_pool_huge_page() - return the next node
- * from which to free a huge page.  Advance the next node id
- * whether or not we find a free huge page to free so that the
- * next attempt to free addresses the next node.
+ * helper for free_pool_huge_page() - return the previously saved
+ * node ["this node"] from which to free a huge page.  Advance the
+ * next node id whether or not we find a free huge page to free so
+ * that the next attempt to free addresses the next node.
  */
-static int hstate_next_node_to_free(struct hstate *h)
+static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 {
-	int nid, next_nid;
+	int nid;
+
+	VM_BUG_ON(!nodes_allowed);
+
+	nid = get_valid_node_allowed(h->next_nid_to_free, nodes_allowed);
+	h->next_nid_to_free = next_node_allowed(nid, nodes_allowed);
 
-	nid = h->next_nid_to_free;
-	next_nid = next_node_allowed(nid);
-	h->next_nid_to_free = next_nid;
 	return nid;
 }
 
@@ -705,13 +715,14 @@ static int hstate_next_node_to_free(stru
  * balanced over allowed nodes.
  * Called with hugetlb_lock locked.
  */
-static int free_pool_huge_page(struct hstate *h, bool acct_surplus)
+static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
+							 bool acct_surplus)
 {
 	int start_nid;
 	int next_nid;
 	int ret = 0;
 
-	start_nid = hstate_next_node_to_free(h);
+	start_nid = hstate_next_node_to_free(h, nodes_allowed);
 	next_nid = start_nid;
 
 	do {
@@ -735,7 +746,7 @@ static int free_pool_huge_page(struct hs
 			ret = 1;
 			break;
 		}
-		next_nid = hstate_next_node_to_free(h);
+		next_nid = hstate_next_node_to_free(h, nodes_allowed);
 	} while (next_nid != start_nid);
 
 	return ret;
@@ -937,7 +948,7 @@ static void return_unused_surplus_pages(
 	 * on-line nodes for us and will handle the hstate accounting.
 	 */
 	while (nr_pages--) {
-		if (!free_pool_huge_page(h, 1))
+		if (!free_pool_huge_page(h, &node_online_map, 1))
 			break;
 	}
 }
@@ -1047,7 +1058,7 @@ int __weak alloc_bootmem_huge_page(struc
 		void *addr;
 
 		addr = __alloc_bootmem_node_nopanic(
-				NODE_DATA(hstate_next_node_to_alloc(h)),
+				NODE_DATA(hstate_next_node_to_alloc(h, NULL)),
 				huge_page_size(h), huge_page_size(h), 0);
 
 		if (addr) {
@@ -1102,7 +1113,7 @@ static void __init hugetlb_hstate_alloc_
 		if (h->order >= MAX_ORDER) {
 			if (!alloc_bootmem_huge_page(h))
 				break;
-		} else if (!alloc_fresh_huge_page(h))
+		} else if (!alloc_fresh_huge_page(h, &node_online_map))
 			break;
 	}
 	h->max_huge_pages = i;
@@ -1144,16 +1155,22 @@ static void __init report_hugepages(void
 }
 
 #ifdef CONFIG_HIGHMEM
-static void try_to_free_low(struct hstate *h, unsigned long count)
+static void try_to_free_low(struct hstate *h, unsigned long count,
+						nodemask_t *nodes_allowed)
 {
 	int i;
 
 	if (h->order >= MAX_ORDER)
 		return;
 
+	if (!nodes_allowed)
+		nodes_allowed = &node_online_map;
+
 	for (i = 0; i < MAX_NUMNODES; ++i) {
 		struct page *page, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
+		if (!node_isset(i, *nodes_allowed))
+			continue;
 		list_for_each_entry_safe(page, next, freel, lru) {
 			if (count >= h->nr_huge_pages)
 				return;
@@ -1167,7 +1184,8 @@ static void try_to_free_low(struct hstat
 	}
 }
 #else
-static inline void try_to_free_low(struct hstate *h, unsigned long count)
+static inline void try_to_free_low(struct hstate *h, unsigned long count,
+						nodemask_t *nodes_allowed)
 {
 }
 #endif
@@ -1177,7 +1195,8 @@ static inline void try_to_free_low(struc
  * balanced by operating on them in a round-robin fashion.
  * Returns 1 if an adjustment was made.
  */
-static int adjust_pool_surplus(struct hstate *h, int delta)
+static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
+				int delta)
 {
 	int start_nid, next_nid;
 	int ret = 0;
@@ -1185,9 +1204,9 @@ static int adjust_pool_surplus(struct hs
 	VM_BUG_ON(delta != -1 && delta != 1);
 
 	if (delta < 0)
-		start_nid = hstate_next_node_to_alloc(h);
+		start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
 	else
-		start_nid = hstate_next_node_to_free(h);
+		start_nid = hstate_next_node_to_free(h, nodes_allowed);
 	next_nid = start_nid;
 
 	do {
@@ -1197,7 +1216,8 @@ static int adjust_pool_surplus(struct hs
 			 * To shrink on this node, there must be a surplus page
 			 */
 			if (!h->surplus_huge_pages_node[nid]) {
-				next_nid = hstate_next_node_to_alloc(h);
+				next_nid = hstate_next_node_to_alloc(h,
+								nodes_allowed);
 				continue;
 			}
 		}
@@ -1207,7 +1227,8 @@ static int adjust_pool_surplus(struct hs
 			 */
 			if (h->surplus_huge_pages_node[nid] >=
 						h->nr_huge_pages_node[nid]) {
-				next_nid = hstate_next_node_to_free(h);
+				next_nid = hstate_next_node_to_free(h,
+								nodes_allowed);
 				continue;
 			}
 		}
@@ -1225,6 +1246,7 @@ static int adjust_pool_surplus(struct hs
 static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
 {
 	unsigned long min_count, ret;
+	nodemask_t *nodes_allowed = &node_online_map;
 
 	if (h->order >= MAX_ORDER)
 		return h->max_huge_pages;
@@ -1242,7 +1264,7 @@ static unsigned long set_max_huge_pages(
 	 */
 	spin_lock(&hugetlb_lock);
 	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, -1))
+		if (!adjust_pool_surplus(h, nodes_allowed, -1))
 			break;
 	}
 
@@ -1253,7 +1275,7 @@ static unsigned long set_max_huge_pages(
 		 * and reducing the surplus.
 		 */
 		spin_unlock(&hugetlb_lock);
-		ret = alloc_fresh_huge_page(h);
+		ret = alloc_fresh_huge_page(h, nodes_allowed);
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -1277,13 +1299,13 @@ static unsigned long set_max_huge_pages(
 	 */
 	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
 	min_count = max(count, min_count);
-	try_to_free_low(h, min_count);
+	try_to_free_low(h, min_count, nodes_allowed);
 	while (min_count < persistent_huge_pages(h)) {
-		if (!free_pool_huge_page(h, 0))
+		if (!free_pool_huge_page(h, nodes_allowed, 0))
 			break;
 	}
 	while (count < persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, 1))
+		if (!adjust_pool_surplus(h, nodes_allowed, 1))
 			break;
 	}
 out:

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 3/11] hugetlb:  introduce alloc_nodemask_of_node
  2009-09-15 20:43 [PATCH 0/11] hugetlb: V7 constrain allocation/free based on task mempolicy Lee Schermerhorn
@ 2009-09-15 20:44   ` Lee Schermerhorn
  2009-09-15 20:44   ` Lee Schermerhorn
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-15 20:44 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 3/11] - hugetlb:  introduce alloc_nodemask_of_node()

Against:  2.6.31-mmotm-090914-0157

New in V5 of series

V6: + rename 'init_nodemask_of_nodes()' to 'init_nodemask_of_node()'
    + redefine init_nodemask_of_node() as static inline fcn
    + move this patch back 1 in series

Introduce nodemask macro to allocate a nodemask and 
initialize it to contain a single node, using the macro
init_nodemask_of_node() factored out of the nodemask_of_node()
macro.

alloc_nodemask_of_node() coded as a macro to avoid header
dependency hell.

This will be used to construct the huge pages "nodes_allowed"
nodemask for a single node when basing nodes_allowed on a
preferred/local mempolicy or when a persistent huge page
pool page count is modified via a per node sysfs attribute.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/nodemask.h |   22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

Index: linux-2.6.31-mmotm-090914-0157/include/linux/nodemask.h
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/include/linux/nodemask.h	2009-09-15 13:38:32.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/include/linux/nodemask.h	2009-09-15 13:42:18.000000000 -0400
@@ -245,18 +245,36 @@ static inline int __next_node(int n, con
 	return min_t(int,MAX_NUMNODES,find_next_bit(srcp->bits, MAX_NUMNODES, n+1));
 }
 
+static inline void init_nodemask_of_node(nodemask_t *mask, int node)
+{
+	nodes_clear(*(mask));
+	node_set((node), *(mask));
+}
+
 #define nodemask_of_node(node)						\
 ({									\
 	typeof(_unused_nodemask_arg_) m;				\
 	if (sizeof(m) == sizeof(unsigned long)) {			\
 		m.bits[0] = 1UL<<(node);				\
 	} else {							\
-		nodes_clear(m);						\
-		node_set((node), m);					\
+		init_nodemask_of_node(&m, (node));			\
 	}								\
 	m;								\
 })
 
+/*
+ * returns pointer to kmalloc()'d nodemask initialized to contain the
+ * specified node.  Caller must free with kfree().
+ */
+#define alloc_nodemask_of_node(node)					\
+({									\
+	typeof(_unused_nodemask_arg_) *nmp;				\
+	nmp = kmalloc(sizeof(*nmp), GFP_KERNEL);			\
+	if (nmp)							\
+		init_nodemask_of_node(nmp, (node));			\
+	nmp;								\
+})
+
 #define first_unset_node(mask) __first_unset_node(&(mask))
 static inline int __first_unset_node(const nodemask_t *maskp)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 3/11] hugetlb:  introduce alloc_nodemask_of_node
@ 2009-09-15 20:44   ` Lee Schermerhorn
  0 siblings, 0 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-15 20:44 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 3/11] - hugetlb:  introduce alloc_nodemask_of_node()

Against:  2.6.31-mmotm-090914-0157

New in V5 of series

V6: + rename 'init_nodemask_of_nodes()' to 'init_nodemask_of_node()'
    + redefine init_nodemask_of_node() as static inline fcn
    + move this patch back 1 in series

Introduce nodemask macro to allocate a nodemask and 
initialize it to contain a single node, using the macro
init_nodemask_of_node() factored out of the nodemask_of_node()
macro.

alloc_nodemask_of_node() coded as a macro to avoid header
dependency hell.

This will be used to construct the huge pages "nodes_allowed"
nodemask for a single node when basing nodes_allowed on a
preferred/local mempolicy or when a persistent huge page
pool page count is modified via a per node sysfs attribute.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/nodemask.h |   22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

Index: linux-2.6.31-mmotm-090914-0157/include/linux/nodemask.h
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/include/linux/nodemask.h	2009-09-15 13:38:32.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/include/linux/nodemask.h	2009-09-15 13:42:18.000000000 -0400
@@ -245,18 +245,36 @@ static inline int __next_node(int n, con
 	return min_t(int,MAX_NUMNODES,find_next_bit(srcp->bits, MAX_NUMNODES, n+1));
 }
 
+static inline void init_nodemask_of_node(nodemask_t *mask, int node)
+{
+	nodes_clear(*(mask));
+	node_set((node), *(mask));
+}
+
 #define nodemask_of_node(node)						\
 ({									\
 	typeof(_unused_nodemask_arg_) m;				\
 	if (sizeof(m) == sizeof(unsigned long)) {			\
 		m.bits[0] = 1UL<<(node);				\
 	} else {							\
-		nodes_clear(m);						\
-		node_set((node), m);					\
+		init_nodemask_of_node(&m, (node));			\
 	}								\
 	m;								\
 })
 
+/*
+ * returns pointer to kmalloc()'d nodemask initialized to contain the
+ * specified node.  Caller must free with kfree().
+ */
+#define alloc_nodemask_of_node(node)					\
+({									\
+	typeof(_unused_nodemask_arg_) *nmp;				\
+	nmp = kmalloc(sizeof(*nmp), GFP_KERNEL);			\
+	if (nmp)							\
+		init_nodemask_of_node(nmp, (node));			\
+	nmp;								\
+})
+
 #define first_unset_node(mask) __first_unset_node(&(mask))
 static inline int __first_unset_node(const nodemask_t *maskp)
 {

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 4/11] hugetlb:  derive huge pages nodes allowed from task mempolicy
  2009-09-15 20:43 [PATCH 0/11] hugetlb: V7 constrain allocation/free based on task mempolicy Lee Schermerhorn
@ 2009-09-15 20:44   ` Lee Schermerhorn
  2009-09-15 20:44   ` Lee Schermerhorn
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-15 20:44 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 4/11] hugetlb:  derive huge pages nodes allowed from task mempolicy

Against:  2.6.31-mmotm-090914-0157

V2: + cleaned up comments, removed some deemed unnecessary,
      add some suggested by review
    + removed check for !current in huge_mpol_nodes_allowed().
    + added 'current->comm' to warning message in huge_mpol_nodes_allowed().
    + added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to
      catch out of range node id.
    + add examples to patch description

V3: Factored this patch from V2 patch 2/3

V4: added back missing "kfree(nodes_allowed)" in set_max_nr_hugepages()

V5: remove internal '\n' from printk in huge_mpol_nodes_allowed()

V6: + rename 'huge_mpol_nodes_allowed()" to "alloc_nodemask_of_mempolicy()"
    + move the printk() when we can't kmalloc() a nodemask_t to
      set_max_huge_pages(), as alloc_nodemask_of_mempolicy() is no longer
      hugepage specific.
    + handle movement of nodes_allowed initialization:
    ++ Don't kfree() nodes_allowed when it points at node_online_map.

V7: + drop mpol-get/put from alloc_nodemask_of_mempolicy().  Not needed
      here because current task is examining it's own mempolicy.  Add
      comment to that effect.
    + use init_nodemask_of_node() to initialize the nodes_allowed for
      single node policies [preferred/local].
      

This patch derives a "nodes_allowed" node mask from the numa
mempolicy of the task modifying the number of persistent huge
pages to control the allocation, freeing and adjusting of surplus
huge pages.  This mask is derived as follows:

* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
  is produced.  This will cause the hugetlb subsystem to use
  node_online_map as the "nodes_allowed".  This preserves the
  behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
  a nodemask with the single preferred node will be produced. 
  "local" policy will NOT track any internode migrations of the
  task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
  will be used.
* Other than to inform the construction of the nodes_allowed node
  mask, the actual mempolicy mode is ignored.  That is, all modes
  behave like interleave over the resulting nodes_allowed mask
  with no "fallback".

Notes:

1) This patch introduces a subtle change in behavior:  huge page
   allocation and freeing will be constrained by any mempolicy
   that the task adjusting the huge page pool inherits from its
   parent.  This policy could come from a distant ancestor.  The
   adminstrator adjusting the huge page pool without explicitly
   specifying a mempolicy via numactl might be surprised by this.
   Additionaly, any mempolicy specified by numactl will be
   constrained by the cpuset in which numactl is invoked.
   Using sysfs per node hugepages attributes to adjust the per
   node persistent huge pages count [subsequent patch] ignores
   mempolicy and cpuset constraints.

2) Hugepages allocated at boot time use the node_online_map.
   An additional patch could implement a temporary boot time
   huge pages nodes_allowed command line parameter.

3) Using mempolicy to control persistent huge page allocation
   and freeing requires no change to hugeadm when invoking
   it via numactl, as shown in the examples below.  However,
   hugeadm could be enhanced to take the allowed nodes as an
   argument and set its task mempolicy itself.  This would allow
   it to detect and warn about any non-default mempolicy that it
   inherited from its parent, thus alleviating the issue described
   in Note 1 above.

See the updated documentation [next patch] for more information
about the implications of this patch.

Examples:

Starting with:

	Node 0 HugePages_Total:     0
	Node 1 HugePages_Total:     0
	Node 2 HugePages_Total:     0
	Node 3 HugePages_Total:     0

Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:

	hugeadm --pool-pages-min=2048Kb:32

yields:

	Node 0 HugePages_Total:     8
	Node 1 HugePages_Total:     8
	Node 2 HugePages_Total:     8
	Node 3 HugePages_Total:     8

Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes.  So, starting from the 
condition above, with 8 huge pages per node:

	numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8

yields:

	Node 0 HugePages_Total:     8
	Node 1 HugePages_Total:     8
	Node 2 HugePages_Total:    16
	Node 3 HugePages_Total:     8

The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.

Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:

	numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8

yields:

	Node 0 HugePages_Total:     4
	Node 1 HugePages_Total:     4
	Node 2 HugePages_Total:    16
	Node 3 HugePages_Total:     8

The 8 huge pages freed were balanced over nodes 0 and 1.

Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/mempolicy.h |    3 ++
 mm/hugetlb.c              |   12 +++++++++-
 mm/mempolicy.c            |   53 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 67 insertions(+), 1 deletion(-)

Index: linux-2.6.31-mmotm-090914-0157/mm/mempolicy.c
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/mm/mempolicy.c	2009-09-15 13:19:02.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/mm/mempolicy.c	2009-09-15 13:42:18.000000000 -0400
@@ -1564,6 +1564,59 @@ struct zonelist *huge_zonelist(struct vm
 	}
 	return zl;
 }
+
+/*
+ * alloc_nodemask_of_mempolicy
+ *
+ * Returns a [pointer to a] nodelist based on the current task's mempolicy.
+ *
+ * If the task's mempolicy is "default" [NULL], return NULL for default
+ * behavior.  Otherwise, extract the policy nodemask for 'bind'
+ * or 'interleave' policy or construct a nodemask for 'preferred' or
+ * 'local' policy and return a pointer to a kmalloc()ed nodemask_t.
+ *
+ * We don't bother with reference counting the mempolicy [mpol_get/put]
+ * because the current task is examining it's own mempolicy and a task's
+ * mempolicy is only ever changed by the task itself.
+ *
+ * N.B., it is the caller's responsibility to free a returned nodemask.
+ */
+nodemask_t *alloc_nodemask_of_mempolicy(void)
+{
+	nodemask_t *nodes_allowed = NULL;
+	struct mempolicy *mempolicy;
+	int nid;
+
+	if (!current->mempolicy)
+		return NULL;
+
+	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
+	if (!nodes_allowed)
+		goto out;		/* silently default */
+
+	mempolicy = current->mempolicy;
+	switch (mempolicy->mode) {
+	case MPOL_PREFERRED:
+		if (mempolicy->flags & MPOL_F_LOCAL)
+			nid = numa_node_id();
+		else
+			nid = mempolicy->v.preferred_node;
+		init_nodemask_of_node(nodes_allowed, nid);
+		break;
+
+	case MPOL_BIND:
+		/* Fall through */
+	case MPOL_INTERLEAVE:
+		*nodes_allowed =  mempolicy->v.nodes;
+		break;
+
+	default:
+		BUG();
+	}
+
+out:
+	return nodes_allowed;
+}
 #endif
 
 /* Allocate a page in interleaved policy.
Index: linux-2.6.31-mmotm-090914-0157/include/linux/mempolicy.h
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/include/linux/mempolicy.h	2009-09-15 13:19:02.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/include/linux/mempolicy.h	2009-09-15 13:42:18.000000000 -0400
@@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
+extern nodemask_t *alloc_nodemask_of_mempolicy(void);
 extern unsigned slab_node(struct mempolicy *policy);
 
 extern enum zone_type policy_zone;
@@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone
 	return node_zonelist(0, gfp_flags);
 }
 
+static inline nodemask_t *alloc_nodemask_of_mempolicy(void) { return NULL; }
+
 static inline int do_migrate_pages(struct mm_struct *mm,
 			const nodemask_t *from_nodes,
 			const nodemask_t *to_nodes, int flags)
Index: linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/mm/hugetlb.c	2009-09-15 13:42:17.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c	2009-09-15 13:42:18.000000000 -0400
@@ -1246,11 +1246,19 @@ static int adjust_pool_surplus(struct hs
 static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
 {
 	unsigned long min_count, ret;
-	nodemask_t *nodes_allowed = &node_online_map;
+	nodemask_t *nodes_allowed;
 
 	if (h->order >= MAX_ORDER)
 		return h->max_huge_pages;
 
+	nodes_allowed = alloc_nodemask_of_mempolicy();
+	if (!nodes_allowed) {
+		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
+			"for huge page allocation.  Falling back to default.\n",
+			current->comm);
+		nodes_allowed = &node_online_map;
+	}
+
 	/*
 	 * Increase the pool size
 	 * First take pages out of surplus state.  Then make up the
@@ -1311,6 +1319,8 @@ static unsigned long set_max_huge_pages(
 out:
 	ret = persistent_huge_pages(h);
 	spin_unlock(&hugetlb_lock);
+	if (nodes_allowed != &node_online_map)
+		kfree(nodes_allowed);
 	return ret;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 4/11] hugetlb:  derive huge pages nodes allowed from task mempolicy
@ 2009-09-15 20:44   ` Lee Schermerhorn
  0 siblings, 0 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-15 20:44 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 4/11] hugetlb:  derive huge pages nodes allowed from task mempolicy

Against:  2.6.31-mmotm-090914-0157

V2: + cleaned up comments, removed some deemed unnecessary,
      add some suggested by review
    + removed check for !current in huge_mpol_nodes_allowed().
    + added 'current->comm' to warning message in huge_mpol_nodes_allowed().
    + added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to
      catch out of range node id.
    + add examples to patch description

V3: Factored this patch from V2 patch 2/3

V4: added back missing "kfree(nodes_allowed)" in set_max_nr_hugepages()

V5: remove internal '\n' from printk in huge_mpol_nodes_allowed()

V6: + rename 'huge_mpol_nodes_allowed()" to "alloc_nodemask_of_mempolicy()"
    + move the printk() when we can't kmalloc() a nodemask_t to
      set_max_huge_pages(), as alloc_nodemask_of_mempolicy() is no longer
      hugepage specific.
    + handle movement of nodes_allowed initialization:
    ++ Don't kfree() nodes_allowed when it points at node_online_map.

V7: + drop mpol-get/put from alloc_nodemask_of_mempolicy().  Not needed
      here because current task is examining it's own mempolicy.  Add
      comment to that effect.
    + use init_nodemask_of_node() to initialize the nodes_allowed for
      single node policies [preferred/local].

This patch derives a "nodes_allowed" node mask from the numa
mempolicy of the task modifying the number of persistent huge
pages to control the allocation, freeing and adjusting of surplus
huge pages.  This mask is derived as follows:

* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
  is produced.  This will cause the hugetlb subsystem to use
  node_online_map as the "nodes_allowed".  This preserves the
  behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
  a nodemask with the single preferred node will be produced. 
  "local" policy will NOT track any internode migrations of the
  task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
  will be used.
* Other than to inform the construction of the nodes_allowed node
  mask, the actual mempolicy mode is ignored.  That is, all modes
  behave like interleave over the resulting nodes_allowed mask
  with no "fallback".

Notes:

1) This patch introduces a subtle change in behavior:  huge page
   allocation and freeing will be constrained by any mempolicy
   that the task adjusting the huge page pool inherits from its
   parent.  This policy could come from a distant ancestor.  The
   adminstrator adjusting the huge page pool without explicitly
   specifying a mempolicy via numactl might be surprised by this.
   Additionaly, any mempolicy specified by numactl will be
   constrained by the cpuset in which numactl is invoked.
   Using sysfs per node hugepages attributes to adjust the per
   node persistent huge pages count [subsequent patch] ignores
   mempolicy and cpuset constraints.

2) Hugepages allocated at boot time use the node_online_map.
   An additional patch could implement a temporary boot time
   huge pages nodes_allowed command line parameter.

3) Using mempolicy to control persistent huge page allocation
   and freeing requires no change to hugeadm when invoking
   it via numactl, as shown in the examples below.  However,
   hugeadm could be enhanced to take the allowed nodes as an
   argument and set its task mempolicy itself.  This would allow
   it to detect and warn about any non-default mempolicy that it
   inherited from its parent, thus alleviating the issue described
   in Note 1 above.

See the updated documentation [next patch] for more information
about the implications of this patch.

Examples:

Starting with:

	Node 0 HugePages_Total:     0
	Node 1 HugePages_Total:     0
	Node 2 HugePages_Total:     0
	Node 3 HugePages_Total:     0

Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:

	hugeadm --pool-pages-min=2048Kb:32

yields:

	Node 0 HugePages_Total:     8
	Node 1 HugePages_Total:     8
	Node 2 HugePages_Total:     8
	Node 3 HugePages_Total:     8

Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes.  So, starting from the 
condition above, with 8 huge pages per node:

	numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8

yields:

	Node 0 HugePages_Total:     8
	Node 1 HugePages_Total:     8
	Node 2 HugePages_Total:    16
	Node 3 HugePages_Total:     8

The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.

Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:

	numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8

yields:

	Node 0 HugePages_Total:     4
	Node 1 HugePages_Total:     4
	Node 2 HugePages_Total:    16
	Node 3 HugePages_Total:     8

The 8 huge pages freed were balanced over nodes 0 and 1.

Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/mempolicy.h |    3 ++
 mm/hugetlb.c              |   12 +++++++++-
 mm/mempolicy.c            |   53 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 67 insertions(+), 1 deletion(-)

Index: linux-2.6.31-mmotm-090914-0157/mm/mempolicy.c
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/mm/mempolicy.c	2009-09-15 13:19:02.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/mm/mempolicy.c	2009-09-15 13:42:18.000000000 -0400
@@ -1564,6 +1564,59 @@ struct zonelist *huge_zonelist(struct vm
 	}
 	return zl;
 }
+
+/*
+ * alloc_nodemask_of_mempolicy
+ *
+ * Returns a [pointer to a] nodelist based on the current task's mempolicy.
+ *
+ * If the task's mempolicy is "default" [NULL], return NULL for default
+ * behavior.  Otherwise, extract the policy nodemask for 'bind'
+ * or 'interleave' policy or construct a nodemask for 'preferred' or
+ * 'local' policy and return a pointer to a kmalloc()ed nodemask_t.
+ *
+ * We don't bother with reference counting the mempolicy [mpol_get/put]
+ * because the current task is examining it's own mempolicy and a task's
+ * mempolicy is only ever changed by the task itself.
+ *
+ * N.B., it is the caller's responsibility to free a returned nodemask.
+ */
+nodemask_t *alloc_nodemask_of_mempolicy(void)
+{
+	nodemask_t *nodes_allowed = NULL;
+	struct mempolicy *mempolicy;
+	int nid;
+
+	if (!current->mempolicy)
+		return NULL;
+
+	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
+	if (!nodes_allowed)
+		goto out;		/* silently default */
+
+	mempolicy = current->mempolicy;
+	switch (mempolicy->mode) {
+	case MPOL_PREFERRED:
+		if (mempolicy->flags & MPOL_F_LOCAL)
+			nid = numa_node_id();
+		else
+			nid = mempolicy->v.preferred_node;
+		init_nodemask_of_node(nodes_allowed, nid);
+		break;
+
+	case MPOL_BIND:
+		/* Fall through */
+	case MPOL_INTERLEAVE:
+		*nodes_allowed =  mempolicy->v.nodes;
+		break;
+
+	default:
+		BUG();
+	}
+
+out:
+	return nodes_allowed;
+}
 #endif

 /* Allocate a page in interleaved policy.
Index: linux-2.6.31-mmotm-090914-0157/include/linux/mempolicy.h
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/include/linux/mempolicy.h	2009-09-15 13:19:02.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/include/linux/mempolicy.h	2009-09-15 13:42:18.000000000 -0400
@@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
+extern nodemask_t *alloc_nodemask_of_mempolicy(void);
 extern unsigned slab_node(struct mempolicy *policy);

 extern enum zone_type policy_zone;
@@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone
 	return node_zonelist(0, gfp_flags);
 }

+static inline nodemask_t *alloc_nodemask_of_mempolicy(void) { return NULL; }
+
 static inline int do_migrate_pages(struct mm_struct *mm,
 			const nodemask_t *from_nodes,
 			const nodemask_t *to_nodes, int flags)
Index: linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/mm/hugetlb.c	2009-09-15 13:42:17.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c	2009-09-15 13:42:18.000000000 -0400
@@ -1246,11 +1246,19 @@ static int adjust_pool_surplus(struct hs
 static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
 {
 	unsigned long min_count, ret;
-	nodemask_t *nodes_allowed = &node_online_map;
+	nodemask_t *nodes_allowed;

 	if (h->order >= MAX_ORDER)
 		return h->max_huge_pages;

+	nodes_allowed = alloc_nodemask_of_mempolicy();
+	if (!nodes_allowed) {
+		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
+			"for huge page allocation.  Falling back to default.\n",
+			current->comm);
+		nodes_allowed = &node_online_map;
+	}
+
 	/*
 	 * Increase the pool size
 	 * First take pages out of surplus state.  Then make up the
@@ -1311,6 +1319,8 @@ static unsigned long set_max_huge_pages(
 out:
 	ret = persistent_huge_pages(h);
 	spin_unlock(&hugetlb_lock);
+	if (nodes_allowed != &node_online_map)
+		kfree(nodes_allowed);
 	return ret;
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 5/11] hugetlb:  add generic definition of NUMA_NO_NODE
  2009-09-15 20:43 [PATCH 0/11] hugetlb: V7 constrain allocation/free based on task mempolicy Lee Schermerhorn
@ 2009-09-15 20:44   ` Lee Schermerhorn
  2009-09-15 20:44   ` Lee Schermerhorn
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-15 20:44 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 5/11] - hugetlb:  promote NUMA_NO_NODE to generic constant

Against:  2.6.31-mmotm-090914-0157

New in V7 of series

Move definition of NUMA_NO_NODE from ia64 and x86_64 arch specific
headers to generic header 'linux/numa.h' for use in generic code.
NUMA_NO_NODE replaces bare '-1' where it's used in this series to
indicate "no node id specified".  Ultimately, it can be used
to replace the -1 elsewhere where it is used similarly.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: David Rientjes <rientjes@google.com>

 arch/ia64/include/asm/numa.h    |    2 --
 arch/x86/include/asm/topology.h |    5 ++---
 include/linux/numa.h            |    2 ++
 3 files changed, 4 insertions(+), 5 deletions(-)

Index: linux-2.6.31-mmotm-090914-0157/arch/ia64/include/asm/numa.h
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/arch/ia64/include/asm/numa.h	2009-09-15 13:19:02.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/arch/ia64/include/asm/numa.h	2009-09-15 13:42:19.000000000 -0400
@@ -22,8 +22,6 @@
 
 #include <asm/mmzone.h>
 
-#define NUMA_NO_NODE	-1
-
 extern u16 cpu_to_node_map[NR_CPUS] __cacheline_aligned;
 extern cpumask_t node_to_cpu_mask[MAX_NUMNODES] __cacheline_aligned;
 extern pg_data_t *pgdat_list[MAX_NUMNODES];
Index: linux-2.6.31-mmotm-090914-0157/arch/x86/include/asm/topology.h
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/arch/x86/include/asm/topology.h	2009-09-15 13:19:02.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/arch/x86/include/asm/topology.h	2009-09-15 13:42:19.000000000 -0400
@@ -35,11 +35,10 @@
 # endif
 #endif
 
-/* Node not present */
-#define NUMA_NO_NODE	(-1)
-
 #ifdef CONFIG_NUMA
 #include <linux/cpumask.h>
+#include <linux/numa.h>
+
 #include <asm/mpspec.h>
 
 #ifdef CONFIG_X86_32
Index: linux-2.6.31-mmotm-090914-0157/include/linux/numa.h
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/include/linux/numa.h	2009-09-15 13:19:02.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/include/linux/numa.h	2009-09-15 13:42:19.000000000 -0400
@@ -10,4 +10,6 @@
 
 #define MAX_NUMNODES    (1 << NODES_SHIFT)
 
+#define	NUMA_NO_NODE	(-1)
+
 #endif /* _LINUX_NUMA_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 5/11] hugetlb:  add generic definition of NUMA_NO_NODE
@ 2009-09-15 20:44   ` Lee Schermerhorn
  0 siblings, 0 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-15 20:44 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 5/11] - hugetlb:  promote NUMA_NO_NODE to generic constant

Against:  2.6.31-mmotm-090914-0157

New in V7 of series

Move definition of NUMA_NO_NODE from ia64 and x86_64 arch specific
headers to generic header 'linux/numa.h' for use in generic code.
NUMA_NO_NODE replaces bare '-1' where it's used in this series to
indicate "no node id specified".  Ultimately, it can be used
to replace the -1 elsewhere where it is used similarly.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: David Rientjes <rientjes@google.com>

 arch/ia64/include/asm/numa.h    |    2 --
 arch/x86/include/asm/topology.h |    5 ++---
 include/linux/numa.h            |    2 ++
 3 files changed, 4 insertions(+), 5 deletions(-)

Index: linux-2.6.31-mmotm-090914-0157/arch/ia64/include/asm/numa.h
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/arch/ia64/include/asm/numa.h	2009-09-15 13:19:02.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/arch/ia64/include/asm/numa.h	2009-09-15 13:42:19.000000000 -0400
@@ -22,8 +22,6 @@
 
 #include <asm/mmzone.h>
 
-#define NUMA_NO_NODE	-1
-
 extern u16 cpu_to_node_map[NR_CPUS] __cacheline_aligned;
 extern cpumask_t node_to_cpu_mask[MAX_NUMNODES] __cacheline_aligned;
 extern pg_data_t *pgdat_list[MAX_NUMNODES];
Index: linux-2.6.31-mmotm-090914-0157/arch/x86/include/asm/topology.h
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/arch/x86/include/asm/topology.h	2009-09-15 13:19:02.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/arch/x86/include/asm/topology.h	2009-09-15 13:42:19.000000000 -0400
@@ -35,11 +35,10 @@
 # endif
 #endif
 
-/* Node not present */
-#define NUMA_NO_NODE	(-1)
-
 #ifdef CONFIG_NUMA
 #include <linux/cpumask.h>
+#include <linux/numa.h>
+
 #include <asm/mpspec.h>
 
 #ifdef CONFIG_X86_32
Index: linux-2.6.31-mmotm-090914-0157/include/linux/numa.h
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/include/linux/numa.h	2009-09-15 13:19:02.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/include/linux/numa.h	2009-09-15 13:42:19.000000000 -0400
@@ -10,4 +10,6 @@
 
 #define MAX_NUMNODES    (1 << NODES_SHIFT)
 
+#define	NUMA_NO_NODE	(-1)
+
 #endif /* _LINUX_NUMA_H */

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 6/11] hugetlb:  add per node hstate attributes
  2009-09-15 20:43 [PATCH 0/11] hugetlb: V7 constrain allocation/free based on task mempolicy Lee Schermerhorn
                   ` (4 preceding siblings ...)
  2009-09-15 20:44   ` Lee Schermerhorn
@ 2009-09-15 20:44 ` Lee Schermerhorn
  2009-09-15 20:45 ` [PATCH 7/11] hugetlb: update hugetlb documentation for mempolicy based management Lee Schermerhorn
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-15 20:44 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 6/11] hugetlb:  register per node hugepages attributes

Against:  2.6.31-mmotm-090914-0157

V2:  remove dependency on kobject private bitfield.  Search
     global hstates then all per node hstates for kobject
     match in attribute show/store functions.

V3:  rebase atop the mempolicy-based hugepage alloc/free;
     use custom "nodes_allowed" to restrict alloc/free to
     a specific node via per node attributes.  Per node
     attribute overrides mempolicy.  I.e., mempolicy only
     applies to global attributes.

V5:  Fix issues raised by Mel Gorman:
     + add !NUMA versions of hugetlb_[un]register_node()
     + rename 'hi' to 'i' in kobj_to_node_hstate()
     + rename (count, input) to (len, count) in nr_hugepages_store()
     + moved per node hugepages_kobj and hstate_kobjs[] from the
       struct node [sysdev] to hugetlb.c private arrays.
     + changed registration mechanism so that hugetlbfs [a module]
       register its attributes registration callbacks with the node
       driver, eliminating the dependency between the node driver
       and hugetlbfs.  From it's init func, hugetlbfs will register
       all on-line nodes' hugepage sysfs attributes along with
       hugetlbfs' attributes register/unregister functions.  The
       node driver will use these functions to [un]register nodes
       with hugetlbfs on node hot-plug.
     + replaced hugetlb.c private "nodes_allowed_from_node()" with
       [new] generic "alloc_nodemask_of_node()".

V5a: + fix !NUMA register_hugetlbfs_with_node():  don't use
       keyword 'do' as parameter name!

V6:  + Use NUMA_NO_NODE for unspecified node id throughout hugetlb.c
       to indicate that we didn't get there via a per node attribute.
       Drop redundant "NO_NODEID_SPECIFIED" definition.
     + handle movement of defaulting of nodes_allowed up to
       set_max_huge_pages()

V7:  + add ifdefs + stubs to eliminate unneeded hugetlb registration
       functions when HUGETLBFS not configured. 
     + add some comments to per node hstate registration code in
       hugetlb.c

This patch adds the per huge page size control/query attributes
to the per node sysdevs:

/sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
	nr_hugepages       - r/w
	free_huge_pages    - r/o
	surplus_huge_pages - r/o

The patch attempts to re-use/share as much of the existing
global hstate attribute initialization and handling, and the
"nodes_allowed" constraint processing as possible.
Calling set_max_huge_pages() with no node indicates a change to
global hstate parameters.  In this case, any non-default task
mempolicy will be used to generate the nodes_allowed mask.  A
valid node id indicates an update to that node's hstate 
parameters, and the count argument specifies the target count
for the specified node.  From this info, we compute the target
global count for the hstate and construct a nodes_allowed node
mask contain only the specified node.

Setting the node specific nr_hugepages via the per node attribute
effectively ignores any task mempolicy or cpuset constraints.

With this patch:

(me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
./  ../  free_hugepages  nr_hugepages  surplus_hugepages

Starting from:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     0
Node 2 HugePages_Free:      0
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 0

Allocate 16 persistent huge pages on node 2:
(me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

[Note that this is equivalent to:
	numactl -m 2 hugeadmin --pool-pages-min 2M:+16
]

Yields:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:    16
Node 2 HugePages_Free:     16
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 16

Global controls work as expected--reduce pool to 8 persistent huge pages:
(me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     8
Node 2 HugePages_Free:      8
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>

 drivers/base/node.c  |   39 +++++++
 include/linux/node.h |   11 +
 mm/hugetlb.c         |  281 +++++++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 301 insertions(+), 30 deletions(-)

Index: linux-2.6.31-mmotm-090914-0157/drivers/base/node.c
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/drivers/base/node.c	2009-09-15 13:22:51.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/drivers/base/node.c	2009-09-15 13:42:22.000000000 -0400
@@ -177,6 +177,43 @@ static ssize_t node_read_distance(struct
 }
 static SYSDEV_ATTR(distance, S_IRUGO, node_read_distance, NULL);
 
+#ifdef CONFIG_HUGETLBFS
+/*
+ * hugetlbfs per node attributes registration interface:
+ * When/if hugetlb[fs] subsystem initializes [sometime after this module],
+ * it will register its per node attributes for all nodes online at that
+ * time.  It will also call register_hugetlbfs_with_node(), below, to
+ * register its attribute registration functions with this node driver.
+ * Once these hooks have been initialized, the node driver will call into
+ * the hugetlb module to [un]register attributes for hot-plugged nodes.
+ */
+static node_registration_func_t __hugetlb_register_node;
+static node_registration_func_t __hugetlb_unregister_node;
+
+static inline void hugetlb_register_node(struct node *node)
+{
+	if (__hugetlb_register_node)
+		__hugetlb_register_node(node);
+}
+
+static inline void hugetlb_unregister_node(struct node *node)
+{
+	if (__hugetlb_unregister_node)
+		__hugetlb_unregister_node(node);
+}
+
+void register_hugetlbfs_with_node(node_registration_func_t doregister,
+				  node_registration_func_t unregister)
+{
+	__hugetlb_register_node   = doregister;
+	__hugetlb_unregister_node = unregister;
+}
+#else
+static inline void hugetlb_register_node(struct node *node) {}
+
+static inline void hugetlb_unregister_node(struct node *node) {}
+#endif
+
 
 /*
  * register_node - Setup a sysfs device for a node.
@@ -200,6 +237,7 @@ int register_node(struct node *node, int
 		sysdev_create_file(&node->sysdev, &attr_distance);
 
 		scan_unevictable_register_node(node);
+		hugetlb_register_node(node);
 	}
 	return error;
 }
@@ -220,6 +258,7 @@ void unregister_node(struct node *node)
 	sysdev_remove_file(&node->sysdev, &attr_distance);
 
 	scan_unevictable_unregister_node(node);
+	hugetlb_unregister_node(node);
 
 	sysdev_unregister(&node->sysdev);
 }
Index: linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/mm/hugetlb.c	2009-09-15 13:42:18.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c	2009-09-15 13:43:13.000000000 -0400
@@ -24,6 +24,7 @@
 #include <asm/io.h>
 
 #include <linux/hugetlb.h>
+#include <linux/node.h>
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
@@ -1243,7 +1244,8 @@ static int adjust_pool_surplus(struct hs
 }
 
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
-static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
+static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
+								int nid)
 {
 	unsigned long min_count, ret;
 	nodemask_t *nodes_allowed;
@@ -1251,7 +1253,17 @@ static unsigned long set_max_huge_pages(
 	if (h->order >= MAX_ORDER)
 		return h->max_huge_pages;
 
-	nodes_allowed = alloc_nodemask_of_mempolicy();
+	if (nid == NUMA_NO_NODE) {
+		nodes_allowed = alloc_nodemask_of_mempolicy();
+	} else {
+		/*
+		 * incoming 'count' is for node 'nid' only, so
+		 * adjust count to global, but restrict alloc/free
+		 * to the specified node.
+		 */
+		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
+		nodes_allowed = alloc_nodemask_of_node(nid);
+	}
 	if (!nodes_allowed) {
 		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
 			"for huge page allocation.  Falling back to default.\n",
@@ -1334,51 +1346,71 @@ out:
 static struct kobject *hugepages_kobj;
 static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
 
-static struct hstate *kobj_to_hstate(struct kobject *kobj)
+static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp);
+
+static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
 {
 	int i;
+
 	for (i = 0; i < HUGE_MAX_HSTATE; i++)
-		if (hstate_kobjs[i] == kobj)
+		if (hstate_kobjs[i] == kobj) {
+			if (nidp)
+				*nidp = NUMA_NO_NODE;
 			return &hstates[i];
-	BUG();
-	return NULL;
+		}
+
+	return kobj_to_node_hstate(kobj, nidp);
 }
 
 static ssize_t nr_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->nr_huge_pages);
+	struct hstate *h;
+	unsigned long nr_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid == NUMA_NO_NODE)
+		nr_huge_pages = h->nr_huge_pages;
+	else
+		nr_huge_pages = h->nr_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", nr_huge_pages);
 }
+
 static ssize_t nr_hugepages_store(struct kobject *kobj,
-		struct kobj_attribute *attr, const char *buf, size_t count)
+		struct kobj_attribute *attr, const char *buf, size_t len)
 {
+	unsigned long count;
+	struct hstate *h;
+	int nid;
 	int err;
-	unsigned long input;
-	struct hstate *h = kobj_to_hstate(kobj);
 
-	err = strict_strtoul(buf, 10, &input);
+	err = strict_strtoul(buf, 10, &count);
 	if (err)
 		return 0;
 
-	h->max_huge_pages = set_max_huge_pages(h, input);
+	h = kobj_to_hstate(kobj, &nid);
+	h->max_huge_pages = set_max_huge_pages(h, count, nid);
 
-	return count;
+	return len;
 }
 HSTATE_ATTR(nr_hugepages);
 
 static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
+
 	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
 }
+
 static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
 		struct kobj_attribute *attr, const char *buf, size_t count)
 {
 	int err;
 	unsigned long input;
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
 
 	err = strict_strtoul(buf, 10, &input);
 	if (err)
@@ -1395,15 +1427,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
 static ssize_t free_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->free_huge_pages);
+	struct hstate *h;
+	unsigned long free_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid == NUMA_NO_NODE)
+		free_huge_pages = h->free_huge_pages;
+	else
+		free_huge_pages = h->free_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", free_huge_pages);
 }
 HSTATE_ATTR_RO(free_hugepages);
 
 static ssize_t resv_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
 	return sprintf(buf, "%lu\n", h->resv_huge_pages);
 }
 HSTATE_ATTR_RO(resv_hugepages);
@@ -1411,8 +1452,17 @@ HSTATE_ATTR_RO(resv_hugepages);
 static ssize_t surplus_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
+	struct hstate *h;
+	unsigned long surplus_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid == NUMA_NO_NODE)
+		surplus_huge_pages = h->surplus_huge_pages;
+	else
+		surplus_huge_pages = h->surplus_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", surplus_huge_pages);
 }
 HSTATE_ATTR_RO(surplus_hugepages);
 
@@ -1429,19 +1479,21 @@ static struct attribute_group hstate_att
 	.attrs = hstate_attrs,
 };
 
-static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
+static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
+				struct kobject *parent,
+				struct kobject **hstate_kobjs,
+				struct attribute_group *hstate_attr_group)
 {
 	int retval;
+	int hi = h - hstates;
 
-	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
-							hugepages_kobj);
-	if (!hstate_kobjs[h - hstates])
+	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
+	if (!hstate_kobjs[hi])
 		return -ENOMEM;
 
-	retval = sysfs_create_group(hstate_kobjs[h - hstates],
-							&hstate_attr_group);
+	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
 	if (retval)
-		kobject_put(hstate_kobjs[h - hstates]);
+		kobject_put(hstate_kobjs[hi]);
 
 	return retval;
 }
@@ -1456,17 +1508,184 @@ static void __init hugetlb_sysfs_init(vo
 		return;
 
 	for_each_hstate(h) {
-		err = hugetlb_sysfs_add_hstate(h);
+		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
+					 hstate_kobjs, &hstate_attr_group);
 		if (err)
 			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
 								h->name);
 	}
 }
 
+#ifdef CONFIG_NUMA
+
+/*
+ * node_hstate/s - associate per node hstate attributes, via their kobjects,
+ * with node sysdevs in node_devices[] using a parallel array.  The array
+ * index of a node sysdev or _hstate == node id.
+ * This is here to avoid any static dependency of the node sysdev driver, in
+ * the base kernel, on the hugetlb module.
+ */
+struct node_hstate {
+	struct kobject		*hugepages_kobj;
+	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
+};
+struct node_hstate node_hstates[MAX_NUMNODES];
+
+/*
+ * A subset of global hstate attributes for node sysdevs
+ */
+static struct attribute *per_node_hstate_attrs[] = {
+	&nr_hugepages_attr.attr,
+	&free_hugepages_attr.attr,
+	&surplus_hugepages_attr.attr,
+	NULL,
+};
+
+static struct attribute_group per_node_hstate_attr_group = {
+	.attrs = per_node_hstate_attrs,
+};
+
+/*
+ * kobj_to_node_hstate - lookup global hstate for node sysdev hstate attr kobj.
+ * Returns node id via non-NULL nidp.
+ */
+static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		struct node_hstate *nhs = &node_hstates[nid];
+		int i;
+		for (i = 0; i < HUGE_MAX_HSTATE; i++)
+			if (nhs->hstate_kobjs[i] == kobj) {
+				if (nidp)
+					*nidp = nid;
+				return &hstates[i];
+			}
+	}
+
+	BUG();
+	return NULL;
+}
+
+/*
+ * Unregister hstate attributes from a single node sysdev.
+ * No-op if no hstate attributes attached.
+ */
+void hugetlb_unregister_node(struct node *node)
+{
+	struct hstate *h;
+	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
+
+	if (!nhs->hugepages_kobj)
+		return;
+
+	for_each_hstate(h)
+		if (nhs->hstate_kobjs[h - hstates]) {
+			kobject_put(nhs->hstate_kobjs[h - hstates]);
+			nhs->hstate_kobjs[h - hstates] = NULL;
+		}
+
+	kobject_put(nhs->hugepages_kobj);
+	nhs->hugepages_kobj = NULL;
+}
+
+/*
+ * hugetlb module exit:  unregister hstate attributes from node sysdevs
+ * that have them.
+ */
+static void hugetlb_unregister_all_nodes(void)
+{
+	int nid;
+
+	/*
+	 * disable node sysdev registrations.
+	 */
+	register_hugetlbfs_with_node(NULL, NULL);
+
+	/*
+	 * remove hstate attributes from any nodes that have them.
+	 */
+	for (nid = 0; nid < nr_node_ids; nid++)
+		hugetlb_unregister_node(&node_devices[nid]);
+}
+
+/*
+ * Register hstate attributes for a single node sysdev.
+ * No-op if attributes already registered.
+ */
+void hugetlb_register_node(struct node *node)
+{
+	struct hstate *h;
+	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
+	int err;
+
+	if (nhs->hugepages_kobj)
+		return;		/* already allocated */
+
+	nhs->hugepages_kobj = kobject_create_and_add("hugepages",
+							&node->sysdev.kobj);
+	if (!nhs->hugepages_kobj)
+		return;
+
+	for_each_hstate(h) {
+		err = hugetlb_sysfs_add_hstate(h, nhs->hugepages_kobj,
+						nhs->hstate_kobjs,
+						&per_node_hstate_attr_group);
+		if (err) {
+			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
+					" for node %d\n",
+						h->name, node->sysdev.id);
+			hugetlb_unregister_node(node);
+			break;
+		}
+	}
+}
+
+/*
+ * hugetlb init time:  register hstate attributes for all registered
+ * node sysdevs.  All on-line nodes should have registered their
+ * associated sysdev by the time the hugetlb module initializes.
+ */
+static void hugetlb_register_all_nodes(void)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		struct node *node = &node_devices[nid];
+		if (node->sysdev.id == nid)
+			hugetlb_register_node(node);
+	}
+
+	/*
+	 * Let the node sysdev driver know we're here so it can
+	 * [un]register hstate attributes on node hotplug.
+	 */
+	register_hugetlbfs_with_node(hugetlb_register_node,
+				     hugetlb_unregister_node);
+}
+#else	/* !CONFIG_NUMA */
+
+static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
+{
+	BUG();
+	if (nidp)
+		*nidp = -1;
+	return NULL;
+}
+
+static void hugetlb_unregister_all_nodes(void) { }
+
+static void hugetlb_register_all_nodes(void) { }
+
+#endif
+
 static void __exit hugetlb_exit(void)
 {
 	struct hstate *h;
 
+	hugetlb_unregister_all_nodes();
+
 	for_each_hstate(h) {
 		kobject_put(hstate_kobjs[h - hstates]);
 	}
@@ -1501,6 +1720,8 @@ static int __init hugetlb_init(void)
 
 	hugetlb_sysfs_init();
 
+	hugetlb_register_all_nodes();
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -1603,7 +1824,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
 	proc_doulongvec_minmax(table, write, buffer, length, ppos);
 
 	if (write)
-		h->max_huge_pages = set_max_huge_pages(h, tmp);
+		h->max_huge_pages = set_max_huge_pages(h, tmp, NUMA_NO_NODE);
 
 	return 0;
 }
Index: linux-2.6.31-mmotm-090914-0157/include/linux/node.h
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/include/linux/node.h	2009-09-15 13:19:02.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/include/linux/node.h	2009-09-15 13:42:22.000000000 -0400
@@ -28,6 +28,7 @@ struct node {
 
 struct memory_block;
 extern struct node node_devices[];
+typedef  void (*node_registration_func_t)(struct node *);
 
 extern int register_node(struct node *, int, struct node *);
 extern void unregister_node(struct node *node);
@@ -39,6 +40,11 @@ extern int unregister_cpu_under_node(uns
 extern int register_mem_sect_under_node(struct memory_block *mem_blk,
 						int nid);
 extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk);
+
+#ifdef CONFIG_HUGETLBFS
+extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
+					 node_registration_func_t unregister);
+#endif
 #else
 static inline int register_one_node(int nid)
 {
@@ -65,6 +71,11 @@ static inline int unregister_mem_sect_un
 {
 	return 0;
 }
+
+static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
+						node_registration_func_t unreg)
+{
+}
 #endif
 
 #define to_node(sys_device) container_of(sys_device, struct node, sysdev)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 7/11] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-15 20:43 [PATCH 0/11] hugetlb: V7 constrain allocation/free based on task mempolicy Lee Schermerhorn
                   ` (5 preceding siblings ...)
  2009-09-15 20:44 ` [PATCH 6/11] hugetlb: add per node hstate attributes Lee Schermerhorn
@ 2009-09-15 20:45 ` Lee Schermerhorn
  2009-09-16 13:37   ` Mel Gorman
  2009-09-15 20:45   ` Lee Schermerhorn
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-15 20:45 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 7/11] hugetlb:  update hugetlb documentation for mempolicy based management.

Against:  2.6.31-mmotm-090914-0157

V2:  Add brief description of per node attributes.

V6:  address review comments

This patch updates the kernel huge tlb documentation to describe the
numa memory policy based huge page management.  Additionaly, the patch
includes a fair amount of rework to improve consistency, eliminate
duplication and set the context for documenting the memory policy
interaction.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: David Rientjes <rientjes@google.com>

 Documentation/vm/hugetlbpage.txt |  263 +++++++++++++++++++++++++--------------
 1 file changed, 175 insertions(+), 88 deletions(-)

Index: linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/Documentation/vm/hugetlbpage.txt	2009-09-15 13:22:53.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt	2009-09-15 13:43:32.000000000 -0400
@@ -11,23 +11,21 @@ This optimization is more critical now a
 (several GBs) are more readily available.
 
 Users can use the huge page support in Linux kernel by either using the mmap
-system call or standard SYSv shared memory system calls (shmget, shmat).
+system call or standard SYSV shared memory system calls (shmget, shmat).
 
 First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
 (present under "File systems") and CONFIG_HUGETLB_PAGE (selected
 automatically when CONFIG_HUGETLBFS is selected) configuration
 options.
 
-The kernel built with huge page support should show the number of configured
-huge pages in the system by running the "cat /proc/meminfo" command.
+The /proc/meminfo file provides information about the total number of
+persistent hugetlb pages in the kernel's huge page pool.  It also displays
+information about the number of free, reserved and surplus huge pages and the
+default huge page size.  The huge page size is needed for generating the
+proper alignment and size of the arguments to system calls that map huge page
+regions.
 
-/proc/meminfo also provides information about the total number of hugetlb
-pages configured in the kernel.  It also displays information about the
-number of free hugetlb pages at any time.  It also displays information about
-the configured huge page size - this is needed for generating the proper
-alignment and size of the arguments to the above system calls.
-
-The output of "cat /proc/meminfo" will have lines like:
+The output of "cat /proc/meminfo" will include lines like:
 
 .....
 HugePages_Total: vvv
@@ -53,59 +51,63 @@ HugePages_Surp  is short for "surplus,"
 /proc/filesystems should also show a filesystem of type "hugetlbfs" configured
 in the kernel.
 
-/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
-pages in the kernel.  Super user can dynamically request more (or free some
-pre-configured) huge pages.
-The allocation (or deallocation) of hugetlb pages is possible only if there are
-enough physically contiguous free pages in system (freeing of huge pages is
-possible only if there are enough hugetlb pages free that can be transferred
-back to regular memory pool).
-
-Pages that are used as hugetlb pages are reserved inside the kernel and cannot
-be used for other purposes.
-
-Once the kernel with Hugetlb page support is built and running, a user can
-use either the mmap system call or shared memory system calls to start using
-the huge pages.  It is required that the system administrator preallocate
-enough memory for huge page purposes.
-
-The administrator can preallocate huge pages on the kernel boot command line by
-specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
-requested.  This is the most reliable method for preallocating huge pages as
-memory has not yet become fragmented.
+/proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge
+pages in the kernel's huge page pool.  "Persistent" huge pages will be
+returned to the huge page pool when freed by a task.  A user with root
+privileges can dynamically allocate more or free some persistent huge pages
+by increasing or decreasing the value of 'nr_hugepages'.
+
+Pages that are used as huge pages are reserved inside the kernel and cannot
+be used for other purposes.  Huge pages cannot be swapped out under
+memory pressure.
+
+Once a number of huge pages have been pre-allocated to the kernel huge page
+pool, a user with appropriate privilege can use either the mmap system call
+or shared memory system calls to use the huge pages.  See the discussion of
+Using Huge Pages, below.
+
+The administrator can allocate persistent huge pages on the kernel boot
+command line by specifying the "hugepages=N" parameter, where 'N' = the
+number of huge pages requested.  This is the most reliable method of
+allocating huge pages as memory has not yet become fragmented.
 
-Some platforms support multiple huge page sizes.  To preallocate huge pages
+Some platforms support multiple huge page sizes.  To allocate huge pages
 of a specific size, one must preceed the huge pages boot command parameters
 with a huge page size selection parameter "hugepagesz=<size>".  <size> must
 be specified in bytes with optional scale suffix [kKmMgG].  The default huge
 page size may be selected with the "default_hugepagesz=<size>" boot parameter.
 
-/proc/sys/vm/nr_hugepages indicates the current number of configured [default
-size] hugetlb pages in the kernel.  Super user can dynamically request more
-(or free some pre-configured) huge pages.
-
-Use the following command to dynamically allocate/deallocate default sized
-huge pages:
+When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
+indicates the current number of pre-allocated huge pages of the default size.
+Thus, one can use the following command to dynamically allocate/deallocate
+default sized persistent huge pages:
 
 	echo 20 > /proc/sys/vm/nr_hugepages
 
-This command will try to configure 20 default sized huge pages in the system.
+This command will try to adjust the number of default sized huge pages in the
+huge page pool to 20, allocating or freeing huge pages, as required.
+
 On a NUMA platform, the kernel will attempt to distribute the huge page pool
-over the all on-line nodes.  These huge pages, allocated when nr_hugepages
-is increased, are called "persistent huge pages".
+over all the set of allowed nodes specified by the NUMA memory policy of the
+task that modifies nr_hugepages.  The default for the allowed nodes--when the
+task has default memory policy--is all on-line nodes.  Allowed nodes with
+insufficient available, contiguous memory for a huge page will be silently
+skipped when allocating persistent huge pages.  See the discussion below of
+the interaction of task memory policy, cpusets and per node attributes with
+the allocation and freeing of persistent huge pages.
 
 The success or failure of huge page allocation depends on the amount of
-physically contiguous memory that is preset in system at the time of the
+physically contiguous memory that is present in system at the time of the
 allocation attempt.  If the kernel is unable to allocate huge pages from
 some nodes in a NUMA system, it will attempt to make up the difference by
 allocating extra pages on other nodes with sufficient available contiguous
 memory, if any.
 
-System administrators may want to put this command in one of the local rc init
-files.  This will enable the kernel to request huge pages early in the boot
-process when the possibility of getting physical contiguous pages is still
-very high.  Administrators can verify the number of huge pages actually
-allocated by checking the sysctl or meminfo.  To check the per node
+System administrators may want to put this command in one of the local rc
+init files.  This will enable the kernel to allocate huge pages early in
+the boot process when the possibility of getting physical contiguous pages
+is still very high.  Administrators can verify the number of huge pages
+actually allocated by checking the sysctl or meminfo.  To check the per node
 distribution of huge pages in a NUMA system, use:
 
 	cat /sys/devices/system/node/node*/meminfo | fgrep Huge
@@ -113,39 +115,40 @@ distribution of huge pages in a NUMA sys
 /proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
 huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
 requested by applications.  Writing any non-zero value into this file
-indicates that the hugetlb subsystem is allowed to try to obtain "surplus"
-huge pages from the buddy allocator, when the normal pool is exhausted. As
-these surplus huge pages go out of use, they are freed back to the buddy
-allocator.
+indicates that the hugetlb subsystem is allowed to try to obtain that
+number of "surplus" huge pages from the kernel's normal page pool, when the
+persistent huge page pool is exhausted. As these surplus huge pages become
+unused, they are freed back to the kernel's normal page pool.
 
-When increasing the huge page pool size via nr_hugepages, any surplus
+When increasing the huge page pool size via nr_hugepages, any existing surplus
 pages will first be promoted to persistent huge pages.  Then, additional
 huge pages will be allocated, if necessary and if possible, to fulfill
-the new huge page pool size.
+the new persistent huge page pool size.
 
-The administrator may shrink the pool of preallocated huge pages for
+The administrator may shrink the pool of persistent huge pages for
 the default huge page size by setting the nr_hugepages sysctl to a
 smaller value.  The kernel will attempt to balance the freeing of huge pages
-across all on-line nodes.  Any free huge pages on the selected nodes will
-be freed back to the buddy allocator.
-
-Caveat: Shrinking the pool via nr_hugepages such that it becomes less
-than the number of huge pages in use will convert the balance to surplus
-huge pages even if it would exceed the overcommit value.  As long as
-this condition holds, however, no more surplus huge pages will be
-allowed on the system until one of the two sysctls are increased
-sufficiently, or the surplus huge pages go out of use and are freed.
+across all nodes in the memory policy of the task modifying nr_hugepages.
+Any free huge pages on the selected nodes will be freed back to the kernel's
+normal page pool.
+
+Caveat: Shrinking the persistent huge page pool via nr_hugepages such that
+it becomes less than the number of huge pages in use will convert the balance
+of the in-use huge pages to surplus huge pages.  This will occur even if
+the number of surplus pages it would exceed the overcommit value.  As long as
+this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is
+increased sufficiently, or the surplus huge pages go out of use and are freed--
+no more surplus huge pages will be allowed to be allocated.
 
 With support for multiple huge page pools at run-time available, much of
-the huge page userspace interface has been duplicated in sysfs. The above
-information applies to the default huge page size which will be
-controlled by the /proc interfaces for backwards compatibility. The root
-huge page control directory in sysfs is:
+the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs.
+The /proc interfaces discussed above have been retained for backwards
+compatibility. The root huge page control directory in sysfs is:
 
 	/sys/kernel/mm/hugepages
 
 For each huge page size supported by the running kernel, a subdirectory
-will exist, of the form
+will exist, of the form:
 
 	hugepages-${size}kB
 
@@ -159,6 +162,98 @@ Inside each of these directories, the sa
 
 which function as described above for the default huge page-sized case.
 
+
+Interaction of Task Memory Policy with Huge Page Allocation/Freeing:
+
+Whether huge pages are allocated and freed via the /proc interface or
+the /sysfs interface, the NUMA nodes from which huge pages are allocated
+or freed are controlled by the NUMA memory policy of the task that modifies
+the nr_hugepages parameter.  [nr_overcommit_hugepages is a global limit.]
+
+The recommended method to allocate or free huge pages to/from the kernel
+huge page pool, using the nr_hugepages example above, is:
+
+    numactl --interleave <node-list> echo 20 >/proc/sys/vm/nr_hugepages
+
+or, more succinctly:
+
+    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages
+
+This will allocate or free abs(20 - nr_hugepages) to or from the nodes
+specified in <node-list>, depending on whether nr_hugepages is initially
+less than or greater than 20, respectively.  No huge pages will be
+allocated nor freed on any node not included in the specified <node-list>.
+
+Any memory policy mode--bind, preferred, local or interleave--may be
+used.  The effect on persistent huge page allocation is as follows:
+
+1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
+   persistent huge pages will be distributed across the node or nodes
+   specified in the mempolicy as if "interleave" had been specified.
+   However, if a node in the policy does not contain sufficient contiguous
+   memory for a huge page, the allocation will not "fallback" to the nearest
+   neighbor node with sufficient contiguous memory.  To do this would cause
+   undesirable imbalance in the distribution of the huge page pool, or
+   possibly, allocation of persistent huge pages on nodes not allowed by
+   the task's memory policy.
+
+2) One or more nodes may be specified with the bind or interleave policy.
+   If more than one node is specified with the preferred policy, only the
+   lowest numeric id will be used.  Local policy will select the node where
+   the task is running at the time the nodes_allowed mask is constructed.
+
+3) For local policy to be deterministic, the task must be bound to a cpu or
+   cpus in a single node.  Otherwise, the task could be migrated to some
+   other node at any time after launch and the resulting node will be
+   indeterminate.  Thus, local policy is not very useful for this purpose.
+   Any of the other mempolicy modes may be used to specify a single node.
+
+4) The nodes allowed mask will be derived from any non-default task mempolicy,
+   whether this policy was set explicitly by the task itself or one of its
+   ancestors, such as numactl.  This means that if the task is invoked from a
+   shell with non-default policy, that policy will be used.  One can specify a
+   node list of "all" with numactl --interleave or --membind [-m] to achieve
+   interleaving over all nodes in the system or cpuset.
+
+5) Any task mempolicy specifed--e.g., using numactl--will be constrained by
+   the resource limits of any cpuset in which the task runs.  Thus, there will
+   be no way for a task with non-default policy running in a cpuset with a
+   subset of the system nodes to allocate huge pages outside the cpuset
+   without first moving to a cpuset that contains all of the desired nodes.
+
+6) Boot-time huge page allocation attempts to distribute the requested number
+   of huge pages over all on-lines nodes.
+
+Per Node Hugepages Attributes
+
+A subset of the contents of the root huge page control directory in sysfs,
+described above, has been replicated under each "node" system device in:
+
+	/sys/devices/system/node/node[0-9]*/hugepages/
+
+Under this directory, the subdirectory for each supported huge page size
+contains the following attribute files:
+
+	nr_hugepages
+	free_hugepages
+	surplus_hugepages
+
+The free_' and surplus_' attribute files are read-only.  They return the number
+of free and surplus [overcommitted] huge pages, respectively, on the parent
+node.
+
+The nr_hugepages attribute will return the total number of huge pages on the
+specified node.  When this attribute is written, the number of persistent huge
+pages on the parent node will be adjusted to the specified value, if sufficient
+resources exist, regardless of the task's mempolicy or cpuset constraints.
+
+Note that the number of overcommit and reserve pages remain global quantities,
+as we don't know until fault time, when the faulting task's mempolicy is applied,
+from which node the huge page allocation will be attempted.
+
+
+Using Huge Pages:
+
 If the user applications are going to request huge pages using mmap system
 call, then it is required that system administrator mount a file system of
 type hugetlbfs:
@@ -206,9 +301,11 @@ map_hugetlb.c.
  * requesting huge pages.
  *
  * For the ia64 architecture, the Linux kernel reserves Region number 4 for
- * huge pages.  That means the addresses starting with 0x800000... will need
- * to be specified.  Specifying a fixed address is not required on ppc64,
- * i386 or x86_64.
+ * huge pages.  That means that if one requires a fixed address, a huge page
+ * aligned address starting with 0x800000... will be required.  If a fixed
+ * address is not required, the kernel will select an address in the proper
+ * range.
+ * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
  *
  * Note: The default shared memory limit is quite low on many kernels,
  * you may need to increase it via:
@@ -237,14 +334,8 @@ map_hugetlb.c.
 
 #define dprintf(x)  printf(x)
 
-/* Only ia64 requires this */
-#ifdef __ia64__
-#define ADDR (void *)(0x8000000000000000UL)
-#define SHMAT_FLAGS (SHM_RND)
-#else
-#define ADDR (void *)(0x0UL)
+#define ADDR (void *)(0x0UL)	/* let kernel choose address */
 #define SHMAT_FLAGS (0)
-#endif
 
 int main(void)
 {
@@ -302,10 +393,12 @@ int main(void)
  * example, the app is requesting memory of size 256MB that is backed by
  * huge pages.
  *
- * For ia64 architecture, Linux kernel reserves Region number 4 for huge pages.
- * That means the addresses starting with 0x800000... will need to be
- * specified.  Specifying a fixed address is not required on ppc64, i386
- * or x86_64.
+ * For the ia64 architecture, the Linux kernel reserves Region number 4 for
+ * huge pages.  That means that if one requires a fixed address, a huge page
+ * aligned address starting with 0x800000... will be required.  If a fixed
+ * address is not required, the kernel will select an address in the proper
+ * range.
+ * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
  */
 #include <stdlib.h>
 #include <stdio.h>
@@ -317,14 +410,8 @@ int main(void)
 #define LENGTH (256UL*1024*1024)
 #define PROTECTION (PROT_READ | PROT_WRITE)
 
-/* Only ia64 requires this */
-#ifdef __ia64__
-#define ADDR (void *)(0x8000000000000000UL)
-#define FLAGS (MAP_SHARED | MAP_FIXED)
-#else
-#define ADDR (void *)(0x0UL)
+#define ADDR (void *)(0x0UL)	/* let kernel choose address */
 #define FLAGS (MAP_SHARED)
-#endif
 
 void check_bytes(char *addr)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 8/11] hugetlb:  Optionally use mempolicy for persistent huge page allocation
  2009-09-15 20:43 [PATCH 0/11] hugetlb: V7 constrain allocation/free based on task mempolicy Lee Schermerhorn
@ 2009-09-15 20:45   ` Lee Schermerhorn
  2009-09-15 20:44   ` Lee Schermerhorn
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-15 20:45 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 8/11] hugetlb:  Optionally use mempolicy for persistent huge page allocation

From: Mel Gorman <mel@csn.ul.ie>

Against:  2.6.31-mmotm-090914-0157

Patch "derive huge pages nodes allowed from task mempolicy" brought
huge page support more in line with the core VM in that tuning the size
of the static huge page pool would obey memory policies. Using this,
administrators could interleave allocation of huge pages from a subset
of nodes. This is consistent with how dynamic hugepage pool resizing
works and how hugepages get allocated to applications at run-time.

However, it was pointed out that scripts may exist that depend on being
able to drain all hugepages via /proc/sys/vm/nr_hugepages from processes
that are running within a memory policy. This patch adds
/proc/sys/vm/nr_hugepages_mempolicy which when written to will obey
memory policies. /proc/sys/vm/nr_hugepages continues then to be a
system-wide tunable regardless of memory policy.

Replicate the vm/nr_hugepages_mempolicy sysctl under the sysfs global
hstate attributes directory.

Note:  with this patch, hugeadm will require update to write to the
vm/nr_hugepages_mempolicy sysctl/attribute when one wants to adjust
the hugepage pool on a specific set of nodes.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>


 Documentation/vm/hugetlbpage.txt |   36 ++++++++-------
 include/linux/hugetlb.h          |    6 ++
 kernel/sysctl.c                  |   12 +++++
 mm/hugetlb.c                     |   91 ++++++++++++++++++++++++++++++++-------
 4 files changed, 114 insertions(+), 31 deletions(-)

Index: linux-2.6.31-mmotm-090914-0157/include/linux/hugetlb.h
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/include/linux/hugetlb.h	2009-09-15 13:23:01.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/include/linux/hugetlb.h	2009-09-15 13:48:11.000000000 -0400
@@ -23,6 +23,12 @@ void reset_vma_resv_huge_pages(struct vm
 int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
 int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
 int hugetlb_treat_movable_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
+
+#ifdef CONFIG_NUMA
+int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int,
+				void __user *, size_t *, loff_t *);
+#endif
+
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
 int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 			struct page **, struct vm_area_struct **,
Index: linux-2.6.31-mmotm-090914-0157/kernel/sysctl.c
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/kernel/sysctl.c	2009-09-15 13:23:01.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/kernel/sysctl.c	2009-09-15 13:43:36.000000000 -0400
@@ -1170,6 +1170,18 @@ static struct ctl_table vm_table[] = {
 		.extra1		= (void *)&hugetlb_zero,
 		.extra2		= (void *)&hugetlb_infinity,
 	 },
+#ifdef CONFIG_NUMA
+	 {
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "nr_hugepages_mempolicy",
+		.data		= NULL,
+		.maxlen		= sizeof(unsigned long),
+		.mode		= 0644,
+		.proc_handler	= &hugetlb_mempolicy_sysctl_handler,
+		.extra1		= (void *)&hugetlb_zero,
+		.extra2		= (void *)&hugetlb_infinity,
+	 },
+#endif
 	 {
 		.ctl_name	= VM_HUGETLB_GROUP,
 		.procname	= "hugetlb_shm_group",
Index: linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/mm/hugetlb.c	2009-09-15 13:43:13.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c	2009-09-15 13:50:28.000000000 -0400
@@ -1243,6 +1243,7 @@ static int adjust_pool_surplus(struct hs
 	return ret;
 }
 
+#define HUGETLB_NO_NODE_OBEY_MEMPOLICY (NUMA_NO_NODE - 1)
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
 static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
 								int nid)
@@ -1253,9 +1254,14 @@ static unsigned long set_max_huge_pages(
 	if (h->order >= MAX_ORDER)
 		return h->max_huge_pages;
 
-	if (nid == NUMA_NO_NODE) {
+	switch (nid) {
+	case HUGETLB_NO_NODE_OBEY_MEMPOLICY:
 		nodes_allowed = alloc_nodemask_of_mempolicy();
-	} else {
+		break;
+	case NUMA_NO_NODE:
+		nodes_allowed = &node_online_map;
+		break;
+	default:
 		/*
 		 * incoming 'count' is for node 'nid' only, so
 		 * adjust count to global, but restrict alloc/free
@@ -1354,23 +1360,24 @@ static struct hstate *kobj_to_hstate(str
 
 	for (i = 0; i < HUGE_MAX_HSTATE; i++)
 		if (hstate_kobjs[i] == kobj) {
-			if (nidp)
-				*nidp = NUMA_NO_NODE;
+			/*
+			 * let *nidp default.
+			 */
 			return &hstates[i];
 		}
 
 	return kobj_to_node_hstate(kobj, nidp);
 }
 
-static ssize_t nr_hugepages_show(struct kobject *kobj,
+static ssize_t nr_hugepages_show_common(int nid_default, struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
 	struct hstate *h;
 	unsigned long nr_huge_pages;
-	int nid;
+	int nid = nid_default;
 
 	h = kobj_to_hstate(kobj, &nid);
-	if (nid == NUMA_NO_NODE)
+	if (nid < 0)
 		nr_huge_pages = h->nr_huge_pages;
 	else
 		nr_huge_pages = h->nr_huge_pages_node[nid];
@@ -1378,12 +1385,12 @@ static ssize_t nr_hugepages_show(struct
 	return sprintf(buf, "%lu\n", nr_huge_pages);
 }
 
-static ssize_t nr_hugepages_store(struct kobject *kobj,
+static ssize_t nr_hugepages_store_common(int nid_default, struct kobject *kobj,
 		struct kobj_attribute *attr, const char *buf, size_t len)
 {
 	unsigned long count;
 	struct hstate *h;
-	int nid;
+	int nid = nid_default;
 	int err;
 
 	err = strict_strtoul(buf, 10, &count);
@@ -1395,8 +1402,42 @@ static ssize_t nr_hugepages_store(struct
 
 	return len;
 }
+
+static ssize_t nr_hugepages_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	return nr_hugepages_show_common(NUMA_NO_NODE, kobj, attr, buf);
+}
+
+static ssize_t nr_hugepages_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t len)
+{
+	return nr_hugepages_store_common(NUMA_NO_NODE, kobj, attr, buf, len);
+}
 HSTATE_ATTR(nr_hugepages);
 
+#ifdef CONFIG_NUMA
+
+/*
+ * hstate attribute for optionally mempolicy-based constraint on persistent
+ * huge page alloc/free.
+ */
+static ssize_t nr_hugepages_mempolicy_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	return nr_hugepages_show_common(HUGETLB_NO_NODE_OBEY_MEMPOLICY,
+						kobj, attr, buf);
+}
+
+static ssize_t nr_hugepages_mempolicy_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t len)
+{
+	return nr_hugepages_store_common(HUGETLB_NO_NODE_OBEY_MEMPOLICY,
+					kobj, attr, buf, len);
+}
+HSTATE_ATTR(nr_hugepages_mempolicy);
+#endif
+
 static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
@@ -1429,7 +1470,7 @@ static ssize_t free_hugepages_show(struc
 {
 	struct hstate *h;
 	unsigned long free_huge_pages;
-	int nid;
+	int nid = NUMA_NO_NODE;
 
 	h = kobj_to_hstate(kobj, &nid);
 	if (nid == NUMA_NO_NODE)
@@ -1454,7 +1495,7 @@ static ssize_t surplus_hugepages_show(st
 {
 	struct hstate *h;
 	unsigned long surplus_huge_pages;
-	int nid;
+	int nid = NUMA_NO_NODE;
 
 	h = kobj_to_hstate(kobj, &nid);
 	if (nid == NUMA_NO_NODE)
@@ -1472,6 +1513,9 @@ static struct attribute *hstate_attrs[]
 	&free_hugepages_attr.attr,
 	&resv_hugepages_attr.attr,
 	&surplus_hugepages_attr.attr,
+#ifdef CONFIG_NUMA
+	&nr_hugepages_mempolicy_attr.attr,
+#endif
 	NULL,
 };
 
@@ -1809,9 +1853,9 @@ static unsigned int cpuset_mems_nr(unsig
 }
 
 #ifdef CONFIG_SYSCTL
-int hugetlb_sysctl_handler(struct ctl_table *table, int write,
-			   void __user *buffer,
-			   size_t *length, loff_t *ppos)
+static int hugetlb_sysctl_handler_common(int no_node,
+			 struct ctl_table *table, int write,
+			 void __user *buffer, size_t *length, loff_t *ppos)
 {
 	struct hstate *h = &default_hstate;
 	unsigned long tmp;
@@ -1824,7 +1868,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
 	proc_doulongvec_minmax(table, write, buffer, length, ppos);
 
 	if (write)
-		h->max_huge_pages = set_max_huge_pages(h, tmp, NUMA_NO_NODE);
+		h->max_huge_pages = set_max_huge_pages(h, tmp, no_node);
 
 	return 0;
 }
@@ -1864,6 +1908,23 @@ int hugetlb_overcommit_handler(struct ct
 	return 0;
 }
 
+int hugetlb_sysctl_handler(struct ctl_table *table, int write,
+			   void __user *buffer, size_t *length, loff_t *ppos)
+{
+
+	return hugetlb_sysctl_handler_common(NUMA_NO_NODE,
+				table, write, buffer, length, ppos);
+}
+
+#ifdef CONFIG_NUMA
+int hugetlb_mempolicy_sysctl_handler(struct ctl_table *table, int write,
+			   void __user *buffer, size_t *length, loff_t *ppos)
+{
+	return hugetlb_sysctl_handler_common(HUGETLB_NO_NODE_OBEY_MEMPOLICY,
+				table, write, buffer, length, ppos);
+}
+#endif /* CONFIG_NUMA */
+
 #endif /* CONFIG_SYSCTL */
 
 void hugetlb_report_meminfo(struct seq_file *m)
Index: linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/Documentation/vm/hugetlbpage.txt	2009-09-15 13:43:32.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt	2009-09-15 13:43:36.000000000 -0400
@@ -155,6 +155,7 @@ will exist, of the form:
 Inside each of these directories, the same set of files will exist:
 
 	nr_hugepages
+	nr_hugepages_mempolicy
 	nr_overcommit_hugepages
 	free_hugepages
 	resv_hugepages
@@ -166,26 +167,30 @@ which function as described above for th
 Interaction of Task Memory Policy with Huge Page Allocation/Freeing:
 
 Whether huge pages are allocated and freed via the /proc interface or
-the /sysfs interface, the NUMA nodes from which huge pages are allocated
-or freed are controlled by the NUMA memory policy of the task that modifies
-the nr_hugepages parameter.  [nr_overcommit_hugepages is a global limit.]
+the /sysfs interface using the nr_hugepages_mempolicy attribute, the NUMA
+nodes from which huge pages are allocated or freed are controlled by the
+NUMA memory policy of the task that modifies the nr_hugepages_mempolicy
+sysctl or attribute.  When the nr_hugepages attribute is used, mempolicy
+is ignored
 
 The recommended method to allocate or free huge pages to/from the kernel
 huge page pool, using the nr_hugepages example above, is:
 
-    numactl --interleave <node-list> echo 20 >/proc/sys/vm/nr_hugepages
+    numactl --interleave <node-list> echo 20 \
+				>/proc/sys/vm/nr_hugepages_mempolicy
 
 or, more succinctly:
 
-    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages
+    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy
 
 This will allocate or free abs(20 - nr_hugepages) to or from the nodes
-specified in <node-list>, depending on whether nr_hugepages is initially
-less than or greater than 20, respectively.  No huge pages will be
+specified in <node-list>, depending on whether number of persistent huge pages
+is initially less than or greater than 20, respectively.  No huge pages will be
 allocated nor freed on any node not included in the specified <node-list>.
 
-Any memory policy mode--bind, preferred, local or interleave--may be
-used.  The effect on persistent huge page allocation is as follows:
+When adjusting the persistent hugepage count via nr_hugepages_mempolicy, any
+memory policy mode--bind, preferred, local or interleave--may be used.  The
+resulting effect on persistent huge page allocation is as follows:
 
 1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
    persistent huge pages will be distributed across the node or nodes
@@ -201,27 +206,26 @@ used.  The effect on persistent huge pag
    If more than one node is specified with the preferred policy, only the
    lowest numeric id will be used.  Local policy will select the node where
    the task is running at the time the nodes_allowed mask is constructed.
-
-3) For local policy to be deterministic, the task must be bound to a cpu or
+   For local policy to be deterministic, the task must be bound to a cpu or
    cpus in a single node.  Otherwise, the task could be migrated to some
    other node at any time after launch and the resulting node will be
    indeterminate.  Thus, local policy is not very useful for this purpose.
    Any of the other mempolicy modes may be used to specify a single node.
 
-4) The nodes allowed mask will be derived from any non-default task mempolicy,
+3) The nodes allowed mask will be derived from any non-default task mempolicy,
    whether this policy was set explicitly by the task itself or one of its
    ancestors, such as numactl.  This means that if the task is invoked from a
    shell with non-default policy, that policy will be used.  One can specify a
    node list of "all" with numactl --interleave or --membind [-m] to achieve
    interleaving over all nodes in the system or cpuset.
 
-5) Any task mempolicy specifed--e.g., using numactl--will be constrained by
+4) Any task mempolicy specifed--e.g., using numactl--will be constrained by
    the resource limits of any cpuset in which the task runs.  Thus, there will
    be no way for a task with non-default policy running in a cpuset with a
    subset of the system nodes to allocate huge pages outside the cpuset
    without first moving to a cpuset that contains all of the desired nodes.
 
-6) Boot-time huge page allocation attempts to distribute the requested number
+5) Boot-time huge page allocation attempts to distribute the requested number
    of huge pages over all on-lines nodes.
 
 Per Node Hugepages Attributes
@@ -248,8 +252,8 @@ pages on the parent node will be adjuste
 resources exist, regardless of the task's mempolicy or cpuset constraints.
 
 Note that the number of overcommit and reserve pages remain global quantities,
-as we don't know until fault time, when the faulting task's mempolicy is applied,
-from which node the huge page allocation will be attempted.
+as we don't know until fault time, when the faulting task's mempolicy is
+applied, from which node the huge page allocation will be attempted.
 
 
 Using Huge Pages:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 8/11] hugetlb:  Optionally use mempolicy for persistent huge page allocation
@ 2009-09-15 20:45   ` Lee Schermerhorn
  0 siblings, 0 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-15 20:45 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 8/11] hugetlb:  Optionally use mempolicy for persistent huge page allocation

From: Mel Gorman <mel@csn.ul.ie>

Against:  2.6.31-mmotm-090914-0157

Patch "derive huge pages nodes allowed from task mempolicy" brought
huge page support more in line with the core VM in that tuning the size
of the static huge page pool would obey memory policies. Using this,
administrators could interleave allocation of huge pages from a subset
of nodes. This is consistent with how dynamic hugepage pool resizing
works and how hugepages get allocated to applications at run-time.

However, it was pointed out that scripts may exist that depend on being
able to drain all hugepages via /proc/sys/vm/nr_hugepages from processes
that are running within a memory policy. This patch adds
/proc/sys/vm/nr_hugepages_mempolicy which when written to will obey
memory policies. /proc/sys/vm/nr_hugepages continues then to be a
system-wide tunable regardless of memory policy.

Replicate the vm/nr_hugepages_mempolicy sysctl under the sysfs global
hstate attributes directory.

Note:  with this patch, hugeadm will require update to write to the
vm/nr_hugepages_mempolicy sysctl/attribute when one wants to adjust
the hugepage pool on a specific set of nodes.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>


 Documentation/vm/hugetlbpage.txt |   36 ++++++++-------
 include/linux/hugetlb.h          |    6 ++
 kernel/sysctl.c                  |   12 +++++
 mm/hugetlb.c                     |   91 ++++++++++++++++++++++++++++++++-------
 4 files changed, 114 insertions(+), 31 deletions(-)

Index: linux-2.6.31-mmotm-090914-0157/include/linux/hugetlb.h
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/include/linux/hugetlb.h	2009-09-15 13:23:01.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/include/linux/hugetlb.h	2009-09-15 13:48:11.000000000 -0400
@@ -23,6 +23,12 @@ void reset_vma_resv_huge_pages(struct vm
 int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
 int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
 int hugetlb_treat_movable_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
+
+#ifdef CONFIG_NUMA
+int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int,
+				void __user *, size_t *, loff_t *);
+#endif
+
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
 int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 			struct page **, struct vm_area_struct **,
Index: linux-2.6.31-mmotm-090914-0157/kernel/sysctl.c
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/kernel/sysctl.c	2009-09-15 13:23:01.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/kernel/sysctl.c	2009-09-15 13:43:36.000000000 -0400
@@ -1170,6 +1170,18 @@ static struct ctl_table vm_table[] = {
 		.extra1		= (void *)&hugetlb_zero,
 		.extra2		= (void *)&hugetlb_infinity,
 	 },
+#ifdef CONFIG_NUMA
+	 {
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "nr_hugepages_mempolicy",
+		.data		= NULL,
+		.maxlen		= sizeof(unsigned long),
+		.mode		= 0644,
+		.proc_handler	= &hugetlb_mempolicy_sysctl_handler,
+		.extra1		= (void *)&hugetlb_zero,
+		.extra2		= (void *)&hugetlb_infinity,
+	 },
+#endif
 	 {
 		.ctl_name	= VM_HUGETLB_GROUP,
 		.procname	= "hugetlb_shm_group",
Index: linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/mm/hugetlb.c	2009-09-15 13:43:13.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c	2009-09-15 13:50:28.000000000 -0400
@@ -1243,6 +1243,7 @@ static int adjust_pool_surplus(struct hs
 	return ret;
 }
 
+#define HUGETLB_NO_NODE_OBEY_MEMPOLICY (NUMA_NO_NODE - 1)
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
 static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
 								int nid)
@@ -1253,9 +1254,14 @@ static unsigned long set_max_huge_pages(
 	if (h->order >= MAX_ORDER)
 		return h->max_huge_pages;
 
-	if (nid == NUMA_NO_NODE) {
+	switch (nid) {
+	case HUGETLB_NO_NODE_OBEY_MEMPOLICY:
 		nodes_allowed = alloc_nodemask_of_mempolicy();
-	} else {
+		break;
+	case NUMA_NO_NODE:
+		nodes_allowed = &node_online_map;
+		break;
+	default:
 		/*
 		 * incoming 'count' is for node 'nid' only, so
 		 * adjust count to global, but restrict alloc/free
@@ -1354,23 +1360,24 @@ static struct hstate *kobj_to_hstate(str
 
 	for (i = 0; i < HUGE_MAX_HSTATE; i++)
 		if (hstate_kobjs[i] == kobj) {
-			if (nidp)
-				*nidp = NUMA_NO_NODE;
+			/*
+			 * let *nidp default.
+			 */
 			return &hstates[i];
 		}
 
 	return kobj_to_node_hstate(kobj, nidp);
 }
 
-static ssize_t nr_hugepages_show(struct kobject *kobj,
+static ssize_t nr_hugepages_show_common(int nid_default, struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
 	struct hstate *h;
 	unsigned long nr_huge_pages;
-	int nid;
+	int nid = nid_default;
 
 	h = kobj_to_hstate(kobj, &nid);
-	if (nid == NUMA_NO_NODE)
+	if (nid < 0)
 		nr_huge_pages = h->nr_huge_pages;
 	else
 		nr_huge_pages = h->nr_huge_pages_node[nid];
@@ -1378,12 +1385,12 @@ static ssize_t nr_hugepages_show(struct
 	return sprintf(buf, "%lu\n", nr_huge_pages);
 }
 
-static ssize_t nr_hugepages_store(struct kobject *kobj,
+static ssize_t nr_hugepages_store_common(int nid_default, struct kobject *kobj,
 		struct kobj_attribute *attr, const char *buf, size_t len)
 {
 	unsigned long count;
 	struct hstate *h;
-	int nid;
+	int nid = nid_default;
 	int err;
 
 	err = strict_strtoul(buf, 10, &count);
@@ -1395,8 +1402,42 @@ static ssize_t nr_hugepages_store(struct
 
 	return len;
 }
+
+static ssize_t nr_hugepages_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	return nr_hugepages_show_common(NUMA_NO_NODE, kobj, attr, buf);
+}
+
+static ssize_t nr_hugepages_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t len)
+{
+	return nr_hugepages_store_common(NUMA_NO_NODE, kobj, attr, buf, len);
+}
 HSTATE_ATTR(nr_hugepages);
 
+#ifdef CONFIG_NUMA
+
+/*
+ * hstate attribute for optionally mempolicy-based constraint on persistent
+ * huge page alloc/free.
+ */
+static ssize_t nr_hugepages_mempolicy_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	return nr_hugepages_show_common(HUGETLB_NO_NODE_OBEY_MEMPOLICY,
+						kobj, attr, buf);
+}
+
+static ssize_t nr_hugepages_mempolicy_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t len)
+{
+	return nr_hugepages_store_common(HUGETLB_NO_NODE_OBEY_MEMPOLICY,
+					kobj, attr, buf, len);
+}
+HSTATE_ATTR(nr_hugepages_mempolicy);
+#endif
+
 static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
@@ -1429,7 +1470,7 @@ static ssize_t free_hugepages_show(struc
 {
 	struct hstate *h;
 	unsigned long free_huge_pages;
-	int nid;
+	int nid = NUMA_NO_NODE;
 
 	h = kobj_to_hstate(kobj, &nid);
 	if (nid == NUMA_NO_NODE)
@@ -1454,7 +1495,7 @@ static ssize_t surplus_hugepages_show(st
 {
 	struct hstate *h;
 	unsigned long surplus_huge_pages;
-	int nid;
+	int nid = NUMA_NO_NODE;
 
 	h = kobj_to_hstate(kobj, &nid);
 	if (nid == NUMA_NO_NODE)
@@ -1472,6 +1513,9 @@ static struct attribute *hstate_attrs[]
 	&free_hugepages_attr.attr,
 	&resv_hugepages_attr.attr,
 	&surplus_hugepages_attr.attr,
+#ifdef CONFIG_NUMA
+	&nr_hugepages_mempolicy_attr.attr,
+#endif
 	NULL,
 };
 
@@ -1809,9 +1853,9 @@ static unsigned int cpuset_mems_nr(unsig
 }
 
 #ifdef CONFIG_SYSCTL
-int hugetlb_sysctl_handler(struct ctl_table *table, int write,
-			   void __user *buffer,
-			   size_t *length, loff_t *ppos)
+static int hugetlb_sysctl_handler_common(int no_node,
+			 struct ctl_table *table, int write,
+			 void __user *buffer, size_t *length, loff_t *ppos)
 {
 	struct hstate *h = &default_hstate;
 	unsigned long tmp;
@@ -1824,7 +1868,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
 	proc_doulongvec_minmax(table, write, buffer, length, ppos);
 
 	if (write)
-		h->max_huge_pages = set_max_huge_pages(h, tmp, NUMA_NO_NODE);
+		h->max_huge_pages = set_max_huge_pages(h, tmp, no_node);
 
 	return 0;
 }
@@ -1864,6 +1908,23 @@ int hugetlb_overcommit_handler(struct ct
 	return 0;
 }
 
+int hugetlb_sysctl_handler(struct ctl_table *table, int write,
+			   void __user *buffer, size_t *length, loff_t *ppos)
+{
+
+	return hugetlb_sysctl_handler_common(NUMA_NO_NODE,
+				table, write, buffer, length, ppos);
+}
+
+#ifdef CONFIG_NUMA
+int hugetlb_mempolicy_sysctl_handler(struct ctl_table *table, int write,
+			   void __user *buffer, size_t *length, loff_t *ppos)
+{
+	return hugetlb_sysctl_handler_common(HUGETLB_NO_NODE_OBEY_MEMPOLICY,
+				table, write, buffer, length, ppos);
+}
+#endif /* CONFIG_NUMA */
+
 #endif /* CONFIG_SYSCTL */
 
 void hugetlb_report_meminfo(struct seq_file *m)
Index: linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/Documentation/vm/hugetlbpage.txt	2009-09-15 13:43:32.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt	2009-09-15 13:43:36.000000000 -0400
@@ -155,6 +155,7 @@ will exist, of the form:
 Inside each of these directories, the same set of files will exist:
 
 	nr_hugepages
+	nr_hugepages_mempolicy
 	nr_overcommit_hugepages
 	free_hugepages
 	resv_hugepages
@@ -166,26 +167,30 @@ which function as described above for th
 Interaction of Task Memory Policy with Huge Page Allocation/Freeing:
 
 Whether huge pages are allocated and freed via the /proc interface or
-the /sysfs interface, the NUMA nodes from which huge pages are allocated
-or freed are controlled by the NUMA memory policy of the task that modifies
-the nr_hugepages parameter.  [nr_overcommit_hugepages is a global limit.]
+the /sysfs interface using the nr_hugepages_mempolicy attribute, the NUMA
+nodes from which huge pages are allocated or freed are controlled by the
+NUMA memory policy of the task that modifies the nr_hugepages_mempolicy
+sysctl or attribute.  When the nr_hugepages attribute is used, mempolicy
+is ignored
 
 The recommended method to allocate or free huge pages to/from the kernel
 huge page pool, using the nr_hugepages example above, is:
 
-    numactl --interleave <node-list> echo 20 >/proc/sys/vm/nr_hugepages
+    numactl --interleave <node-list> echo 20 \
+				>/proc/sys/vm/nr_hugepages_mempolicy
 
 or, more succinctly:
 
-    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages
+    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy
 
 This will allocate or free abs(20 - nr_hugepages) to or from the nodes
-specified in <node-list>, depending on whether nr_hugepages is initially
-less than or greater than 20, respectively.  No huge pages will be
+specified in <node-list>, depending on whether number of persistent huge pages
+is initially less than or greater than 20, respectively.  No huge pages will be
 allocated nor freed on any node not included in the specified <node-list>.
 
-Any memory policy mode--bind, preferred, local or interleave--may be
-used.  The effect on persistent huge page allocation is as follows:
+When adjusting the persistent hugepage count via nr_hugepages_mempolicy, any
+memory policy mode--bind, preferred, local or interleave--may be used.  The
+resulting effect on persistent huge page allocation is as follows:
 
 1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
    persistent huge pages will be distributed across the node or nodes
@@ -201,27 +206,26 @@ used.  The effect on persistent huge pag
    If more than one node is specified with the preferred policy, only the
    lowest numeric id will be used.  Local policy will select the node where
    the task is running at the time the nodes_allowed mask is constructed.
-
-3) For local policy to be deterministic, the task must be bound to a cpu or
+   For local policy to be deterministic, the task must be bound to a cpu or
    cpus in a single node.  Otherwise, the task could be migrated to some
    other node at any time after launch and the resulting node will be
    indeterminate.  Thus, local policy is not very useful for this purpose.
    Any of the other mempolicy modes may be used to specify a single node.
 
-4) The nodes allowed mask will be derived from any non-default task mempolicy,
+3) The nodes allowed mask will be derived from any non-default task mempolicy,
    whether this policy was set explicitly by the task itself or one of its
    ancestors, such as numactl.  This means that if the task is invoked from a
    shell with non-default policy, that policy will be used.  One can specify a
    node list of "all" with numactl --interleave or --membind [-m] to achieve
    interleaving over all nodes in the system or cpuset.
 
-5) Any task mempolicy specifed--e.g., using numactl--will be constrained by
+4) Any task mempolicy specifed--e.g., using numactl--will be constrained by
    the resource limits of any cpuset in which the task runs.  Thus, there will
    be no way for a task with non-default policy running in a cpuset with a
    subset of the system nodes to allocate huge pages outside the cpuset
    without first moving to a cpuset that contains all of the desired nodes.
 
-6) Boot-time huge page allocation attempts to distribute the requested number
+5) Boot-time huge page allocation attempts to distribute the requested number
    of huge pages over all on-lines nodes.
 
 Per Node Hugepages Attributes
@@ -248,8 +252,8 @@ pages on the parent node will be adjuste
 resources exist, regardless of the task's mempolicy or cpuset constraints.
 
 Note that the number of overcommit and reserve pages remain global quantities,
-as we don't know until fault time, when the faulting task's mempolicy is applied,
-from which node the huge page allocation will be attempted.
+as we don't know until fault time, when the faulting task's mempolicy is
+applied, from which node the huge page allocation will be attempted.
 
 
 Using Huge Pages:

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 9/11] hugetlb:  use only nodes with memory for huge pages
  2009-09-15 20:43 [PATCH 0/11] hugetlb: V7 constrain allocation/free based on task mempolicy Lee Schermerhorn
@ 2009-09-15 20:45   ` Lee Schermerhorn
  2009-09-15 20:44   ` Lee Schermerhorn
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-15 20:45 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 9/11] hugetlb:  use only nodes with memory

Against:  2.6.31-mmotm-090914-0157

Register per node hstate sysfs attributes only for nodes with
memory.  Suggested by David Rientjes.

A subsequent patch will handle adding/removing of per node hstate
sysfs attributes when nodes transition to/from memoryless state
via memory hotplug.

NOTE:  this patch has not been tested with memoryless nodes.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/vm/hugetlbpage.txt |   12 ++++++------
 mm/hugetlb.c                     |   39 ++++++++++++++++++++-------------------
 2 files changed, 26 insertions(+), 25 deletions(-)

Index: linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/mm/hugetlb.c	2009-09-15 13:50:28.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c	2009-09-15 13:52:04.000000000 -0400
@@ -942,14 +942,14 @@ static void return_unused_surplus_pages(
 
 	/*
 	 * We want to release as many surplus pages as possible, spread
-	 * evenly across all nodes. Iterate across all nodes until we
-	 * can no longer free unreserved surplus pages. This occurs when
-	 * the nodes with surplus pages have no free pages.
-	 * free_pool_huge_page() will balance the the frees across the
-	 * on-line nodes for us and will handle the hstate accounting.
+	 * evenly across all nodes with memory. Iterate across these nodes
+	 * until we can no longer free unreserved surplus pages. This occurs
+	 * when the nodes with surplus pages have no free pages.
+	 * free_pool_huge_page() will balance the the freed pages across the
+	 * on-line nodes with memory and will handle the hstate accounting.
 	 */
 	while (nr_pages--) {
-		if (!free_pool_huge_page(h, &node_online_map, 1))
+		if (!free_pool_huge_page(h, &node_states[N_HIGH_MEMORY], 1))
 			break;
 	}
 }
@@ -1053,7 +1053,7 @@ static struct page *alloc_huge_page(stru
 int __weak alloc_bootmem_huge_page(struct hstate *h)
 {
 	struct huge_bootmem_page *m;
-	int nr_nodes = nodes_weight(node_online_map);
+	int nr_nodes = nodes_weight(node_states[N_HIGH_MEMORY]);
 
 	while (nr_nodes) {
 		void *addr;
@@ -1114,7 +1114,8 @@ static void __init hugetlb_hstate_alloc_
 		if (h->order >= MAX_ORDER) {
 			if (!alloc_bootmem_huge_page(h))
 				break;
-		} else if (!alloc_fresh_huge_page(h, &node_online_map))
+		} else if (!alloc_fresh_huge_page(h,
+					 &node_states[N_HIGH_MEMORY]))
 			break;
 	}
 	h->max_huge_pages = i;
@@ -1165,7 +1166,7 @@ static void try_to_free_low(struct hstat
 		return;
 
 	if (!nodes_allowed)
-		nodes_allowed = &node_online_map;
+		nodes_allowed = &node_states[N_HIGH_MEMORY];
 
 	for (i = 0; i < MAX_NUMNODES; ++i) {
 		struct page *page, *next;
@@ -1259,7 +1260,7 @@ static unsigned long set_max_huge_pages(
 		nodes_allowed = alloc_nodemask_of_mempolicy();
 		break;
 	case NUMA_NO_NODE:
-		nodes_allowed = &node_online_map;
+		nodes_allowed = &node_states[N_HIGH_MEMORY];
 		break;
 	default:
 		/*
@@ -1274,7 +1275,7 @@ static unsigned long set_max_huge_pages(
 		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
 			"for huge page allocation.  Falling back to default.\n",
 			current->comm);
-		nodes_allowed = &node_online_map;
+		nodes_allowed = &node_states[N_HIGH_MEMORY];
 	}
 
 	/*
@@ -1337,7 +1338,7 @@ static unsigned long set_max_huge_pages(
 out:
 	ret = persistent_huge_pages(h);
 	spin_unlock(&hugetlb_lock);
-	if (nodes_allowed != &node_online_map)
+	if (nodes_allowed != &node_states[N_HIGH_MEMORY])
 		kfree(nodes_allowed);
 	return ret;
 }
@@ -1622,7 +1623,7 @@ void hugetlb_unregister_node(struct node
 	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
 
 	if (!nhs->hugepages_kobj)
-		return;
+		return;		/* no hstate attributes */
 
 	for_each_hstate(h)
 		if (nhs->hstate_kobjs[h - hstates]) {
@@ -1687,15 +1688,15 @@ void hugetlb_register_node(struct node *
 }
 
 /*
- * hugetlb init time:  register hstate attributes for all registered
- * node sysdevs.  All on-line nodes should have registered their
- * associated sysdev by the time the hugetlb module initializes.
+ * hugetlb init time:  register hstate attributes for all registered node
+ * sysdevs of nodes that have memory.  All on-line nodes should have
+ * registered their associated sysdev by this time.
  */
 static void hugetlb_register_all_nodes(void)
 {
 	int nid;
 
-	for (nid = 0; nid < nr_node_ids; nid++) {
+	for_each_node_state(nid, N_HIGH_MEMORY) {
 		struct node *node = &node_devices[nid];
 		if (node->sysdev.id == nid)
 			hugetlb_register_node(node);
@@ -1789,8 +1790,8 @@ void __init hugetlb_add_hstate(unsigned
 	h->free_huge_pages = 0;
 	for (i = 0; i < MAX_NUMNODES; ++i)
 		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
-	h->next_nid_to_alloc = first_node(node_online_map);
-	h->next_nid_to_free = first_node(node_online_map);
+	h->next_nid_to_alloc = first_node(node_states[N_HIGH_MEMORY]);
+	h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
 	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
 					huge_page_size(h)/1024);
 
Index: linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/Documentation/vm/hugetlbpage.txt	2009-09-15 13:43:36.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt	2009-09-15 13:52:04.000000000 -0400
@@ -90,11 +90,11 @@ huge page pool to 20, allocating or free
 On a NUMA platform, the kernel will attempt to distribute the huge page pool
 over all the set of allowed nodes specified by the NUMA memory policy of the
 task that modifies nr_hugepages.  The default for the allowed nodes--when the
-task has default memory policy--is all on-line nodes.  Allowed nodes with
-insufficient available, contiguous memory for a huge page will be silently
-skipped when allocating persistent huge pages.  See the discussion below of
-the interaction of task memory policy, cpusets and per node attributes with
-the allocation and freeing of persistent huge pages.
+task has default memory policy--is all on-line nodes with memory.  Allowed
+nodes with insufficient available, contiguous memory for a huge page will be
+silently skipped when allocating persistent huge pages.  See the discussion
+below of the interaction of task memory policy, cpusets and per node attributes
+with the allocation and freeing of persistent huge pages.
 
 The success or failure of huge page allocation depends on the amount of
 physically contiguous memory that is present in system at the time of the
@@ -226,7 +226,7 @@ resulting effect on persistent huge page
    without first moving to a cpuset that contains all of the desired nodes.
 
 5) Boot-time huge page allocation attempts to distribute the requested number
-   of huge pages over all on-lines nodes.
+   of huge pages over all on-lines nodes with memory.
 
 Per Node Hugepages Attributes
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 9/11] hugetlb:  use only nodes with memory for huge pages
@ 2009-09-15 20:45   ` Lee Schermerhorn
  0 siblings, 0 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-15 20:45 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 9/11] hugetlb:  use only nodes with memory

Against:  2.6.31-mmotm-090914-0157

Register per node hstate sysfs attributes only for nodes with
memory.  Suggested by David Rientjes.

A subsequent patch will handle adding/removing of per node hstate
sysfs attributes when nodes transition to/from memoryless state
via memory hotplug.

NOTE:  this patch has not been tested with memoryless nodes.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/vm/hugetlbpage.txt |   12 ++++++------
 mm/hugetlb.c                     |   39 ++++++++++++++++++++-------------------
 2 files changed, 26 insertions(+), 25 deletions(-)

Index: linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/mm/hugetlb.c	2009-09-15 13:50:28.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c	2009-09-15 13:52:04.000000000 -0400
@@ -942,14 +942,14 @@ static void return_unused_surplus_pages(
 
 	/*
 	 * We want to release as many surplus pages as possible, spread
-	 * evenly across all nodes. Iterate across all nodes until we
-	 * can no longer free unreserved surplus pages. This occurs when
-	 * the nodes with surplus pages have no free pages.
-	 * free_pool_huge_page() will balance the the frees across the
-	 * on-line nodes for us and will handle the hstate accounting.
+	 * evenly across all nodes with memory. Iterate across these nodes
+	 * until we can no longer free unreserved surplus pages. This occurs
+	 * when the nodes with surplus pages have no free pages.
+	 * free_pool_huge_page() will balance the the freed pages across the
+	 * on-line nodes with memory and will handle the hstate accounting.
 	 */
 	while (nr_pages--) {
-		if (!free_pool_huge_page(h, &node_online_map, 1))
+		if (!free_pool_huge_page(h, &node_states[N_HIGH_MEMORY], 1))
 			break;
 	}
 }
@@ -1053,7 +1053,7 @@ static struct page *alloc_huge_page(stru
 int __weak alloc_bootmem_huge_page(struct hstate *h)
 {
 	struct huge_bootmem_page *m;
-	int nr_nodes = nodes_weight(node_online_map);
+	int nr_nodes = nodes_weight(node_states[N_HIGH_MEMORY]);
 
 	while (nr_nodes) {
 		void *addr;
@@ -1114,7 +1114,8 @@ static void __init hugetlb_hstate_alloc_
 		if (h->order >= MAX_ORDER) {
 			if (!alloc_bootmem_huge_page(h))
 				break;
-		} else if (!alloc_fresh_huge_page(h, &node_online_map))
+		} else if (!alloc_fresh_huge_page(h,
+					 &node_states[N_HIGH_MEMORY]))
 			break;
 	}
 	h->max_huge_pages = i;
@@ -1165,7 +1166,7 @@ static void try_to_free_low(struct hstat
 		return;
 
 	if (!nodes_allowed)
-		nodes_allowed = &node_online_map;
+		nodes_allowed = &node_states[N_HIGH_MEMORY];
 
 	for (i = 0; i < MAX_NUMNODES; ++i) {
 		struct page *page, *next;
@@ -1259,7 +1260,7 @@ static unsigned long set_max_huge_pages(
 		nodes_allowed = alloc_nodemask_of_mempolicy();
 		break;
 	case NUMA_NO_NODE:
-		nodes_allowed = &node_online_map;
+		nodes_allowed = &node_states[N_HIGH_MEMORY];
 		break;
 	default:
 		/*
@@ -1274,7 +1275,7 @@ static unsigned long set_max_huge_pages(
 		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
 			"for huge page allocation.  Falling back to default.\n",
 			current->comm);
-		nodes_allowed = &node_online_map;
+		nodes_allowed = &node_states[N_HIGH_MEMORY];
 	}
 
 	/*
@@ -1337,7 +1338,7 @@ static unsigned long set_max_huge_pages(
 out:
 	ret = persistent_huge_pages(h);
 	spin_unlock(&hugetlb_lock);
-	if (nodes_allowed != &node_online_map)
+	if (nodes_allowed != &node_states[N_HIGH_MEMORY])
 		kfree(nodes_allowed);
 	return ret;
 }
@@ -1622,7 +1623,7 @@ void hugetlb_unregister_node(struct node
 	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
 
 	if (!nhs->hugepages_kobj)
-		return;
+		return;		/* no hstate attributes */
 
 	for_each_hstate(h)
 		if (nhs->hstate_kobjs[h - hstates]) {
@@ -1687,15 +1688,15 @@ void hugetlb_register_node(struct node *
 }
 
 /*
- * hugetlb init time:  register hstate attributes for all registered
- * node sysdevs.  All on-line nodes should have registered their
- * associated sysdev by the time the hugetlb module initializes.
+ * hugetlb init time:  register hstate attributes for all registered node
+ * sysdevs of nodes that have memory.  All on-line nodes should have
+ * registered their associated sysdev by this time.
  */
 static void hugetlb_register_all_nodes(void)
 {
 	int nid;
 
-	for (nid = 0; nid < nr_node_ids; nid++) {
+	for_each_node_state(nid, N_HIGH_MEMORY) {
 		struct node *node = &node_devices[nid];
 		if (node->sysdev.id == nid)
 			hugetlb_register_node(node);
@@ -1789,8 +1790,8 @@ void __init hugetlb_add_hstate(unsigned
 	h->free_huge_pages = 0;
 	for (i = 0; i < MAX_NUMNODES; ++i)
 		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
-	h->next_nid_to_alloc = first_node(node_online_map);
-	h->next_nid_to_free = first_node(node_online_map);
+	h->next_nid_to_alloc = first_node(node_states[N_HIGH_MEMORY]);
+	h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
 	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
 					huge_page_size(h)/1024);
 
Index: linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/Documentation/vm/hugetlbpage.txt	2009-09-15 13:43:36.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt	2009-09-15 13:52:04.000000000 -0400
@@ -90,11 +90,11 @@ huge page pool to 20, allocating or free
 On a NUMA platform, the kernel will attempt to distribute the huge page pool
 over all the set of allowed nodes specified by the NUMA memory policy of the
 task that modifies nr_hugepages.  The default for the allowed nodes--when the
-task has default memory policy--is all on-line nodes.  Allowed nodes with
-insufficient available, contiguous memory for a huge page will be silently
-skipped when allocating persistent huge pages.  See the discussion below of
-the interaction of task memory policy, cpusets and per node attributes with
-the allocation and freeing of persistent huge pages.
+task has default memory policy--is all on-line nodes with memory.  Allowed
+nodes with insufficient available, contiguous memory for a huge page will be
+silently skipped when allocating persistent huge pages.  See the discussion
+below of the interaction of task memory policy, cpusets and per node attributes
+with the allocation and freeing of persistent huge pages.
 
 The success or failure of huge page allocation depends on the amount of
 physically contiguous memory that is present in system at the time of the
@@ -226,7 +226,7 @@ resulting effect on persistent huge page
    without first moving to a cpuset that contains all of the desired nodes.
 
 5) Boot-time huge page allocation attempts to distribute the requested number
-   of huge pages over all on-lines nodes.
+   of huge pages over all on-lines nodes with memory.
 
 Per Node Hugepages Attributes
 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 10/11] hugetlb:  handle memory hot-plug events
  2009-09-15 20:43 [PATCH 0/11] hugetlb: V7 constrain allocation/free based on task mempolicy Lee Schermerhorn
@ 2009-09-15 20:45   ` Lee Schermerhorn
  2009-09-15 20:44   ` Lee Schermerhorn
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-15 20:45 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 10/11] hugetlb:  per node attributes -- handle memory hot plug

Against:  2.6.31-mmotm-090914-0157

Register per node hstate attributes only for nodes with memory.

With Memory Hotplug, memory can be added to a memoryless node and
a node with memory can become memoryless.  Therefore, add a memory
on/off-line notifier callback to [un]register a node's attributes
on transition to/from memoryless state.

N.B.,  Only tested build, boot, libhugetlbfs regression.
       i.e., no memory hotplug testing.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/vm/hugetlbpage.txt |    3 +-
 drivers/base/node.c              |   53 +++++++++++++++++++++++++++++++++++----
 2 files changed, 50 insertions(+), 6 deletions(-)

Index: linux-2.6.31-mmotm-090914-0157/drivers/base/node.c
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/drivers/base/node.c	2009-09-15 13:42:22.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/drivers/base/node.c	2009-09-15 13:52:07.000000000 -0400
@@ -181,8 +181,8 @@ static SYSDEV_ATTR(distance, S_IRUGO, no
 /*
  * hugetlbfs per node attributes registration interface:
  * When/if hugetlb[fs] subsystem initializes [sometime after this module],
- * it will register its per node attributes for all nodes online at that
- * time.  It will also call register_hugetlbfs_with_node(), below, to
+ * it will register its per node attributes for all online nodes with
+ * memory.  It will also call register_hugetlbfs_with_node(), below, to
  * register its attribute registration functions with this node driver.
  * Once these hooks have been initialized, the node driver will call into
  * the hugetlb module to [un]register attributes for hot-plugged nodes.
@@ -192,7 +192,8 @@ static node_registration_func_t __hugetl
 
 static inline void hugetlb_register_node(struct node *node)
 {
-	if (__hugetlb_register_node)
+	if (__hugetlb_register_node &&
+			node_state(node->sysdev.id, N_HIGH_MEMORY))
 		__hugetlb_register_node(node);
 }
 
@@ -237,6 +238,7 @@ int register_node(struct node *node, int
 		sysdev_create_file(&node->sysdev, &attr_distance);
 
 		scan_unevictable_register_node(node);
+
 		hugetlb_register_node(node);
 	}
 	return error;
@@ -258,7 +260,7 @@ void unregister_node(struct node *node)
 	sysdev_remove_file(&node->sysdev, &attr_distance);
 
 	scan_unevictable_unregister_node(node);
-	hugetlb_unregister_node(node);
+	hugetlb_unregister_node(node);		/* no-op, if memoryless node */
 
 	sysdev_unregister(&node->sysdev);
 }
@@ -388,8 +390,45 @@ static int link_mem_sections(int nid)
 	}
 	return err;
 }
+
+/*
+ * Handle per node hstate attribute [un]registration on transistions
+ * to/from memoryless state.
+ */
+
+static int node_memory_callback(struct notifier_block *self,
+				unsigned long action, void *arg)
+{
+	struct memory_notify *mnb = arg;
+	int nid = mnb->status_change_nid;
+
+	switch (action) {
+	case MEM_ONLINE:    /* memory successfully brought online */
+		if (nid != NUMA_NO_NODE)
+			hugetlb_register_node(&node_devices[nid]);
+		break;
+	case MEM_OFFLINE:   /* or offline */
+		if (nid != NUMA_NO_NODE)
+			hugetlb_unregister_node(&node_devices[nid]);
+		break;
+	case MEM_GOING_ONLINE:
+	case MEM_GOING_OFFLINE:
+	case MEM_CANCEL_ONLINE:
+	case MEM_CANCEL_OFFLINE:
+	default:
+		break;
+	}
+
+	return NOTIFY_OK;
+}
 #else
 static int link_mem_sections(int nid) { return 0; }
+
+static inline int node_memory_callback(struct notifier_block *self,
+				unsigned long action, void *arg)
+{
+	return NOTIFY_OK;
+}
 #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
 
 int register_one_node(int nid)
@@ -503,13 +542,17 @@ static int node_states_init(void)
 	return err;
 }
 
+#define NODE_CALLBACK_PRI	2	/* lower than SLAB */
 static int __init register_node_type(void)
 {
 	int ret;
 
 	ret = sysdev_class_register(&node_class);
-	if (!ret)
+	if (!ret) {
 		ret = node_states_init();
+		hotplug_memory_notifier(node_memory_callback,
+					NODE_CALLBACK_PRI);
+	}
 
 	/*
 	 * Note:  we're not going to unregister the node class if we fail
Index: linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/Documentation/vm/hugetlbpage.txt	2009-09-15 13:52:04.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt	2009-09-15 13:52:07.000000000 -0400
@@ -231,7 +231,8 @@ resulting effect on persistent huge page
 Per Node Hugepages Attributes
 
 A subset of the contents of the root huge page control directory in sysfs,
-described above, has been replicated under each "node" system device in:
+described above, will be replicated under each the system device of each
+NUMA node with memory in:
 
 	/sys/devices/system/node/node[0-9]*/hugepages/
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 10/11] hugetlb:  handle memory hot-plug events
@ 2009-09-15 20:45   ` Lee Schermerhorn
  0 siblings, 0 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-15 20:45 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 10/11] hugetlb:  per node attributes -- handle memory hot plug

Against:  2.6.31-mmotm-090914-0157

Register per node hstate attributes only for nodes with memory.

With Memory Hotplug, memory can be added to a memoryless node and
a node with memory can become memoryless.  Therefore, add a memory
on/off-line notifier callback to [un]register a node's attributes
on transition to/from memoryless state.

N.B.,  Only tested build, boot, libhugetlbfs regression.
       i.e., no memory hotplug testing.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/vm/hugetlbpage.txt |    3 +-
 drivers/base/node.c              |   53 +++++++++++++++++++++++++++++++++++----
 2 files changed, 50 insertions(+), 6 deletions(-)

Index: linux-2.6.31-mmotm-090914-0157/drivers/base/node.c
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/drivers/base/node.c	2009-09-15 13:42:22.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/drivers/base/node.c	2009-09-15 13:52:07.000000000 -0400
@@ -181,8 +181,8 @@ static SYSDEV_ATTR(distance, S_IRUGO, no
 /*
  * hugetlbfs per node attributes registration interface:
  * When/if hugetlb[fs] subsystem initializes [sometime after this module],
- * it will register its per node attributes for all nodes online at that
- * time.  It will also call register_hugetlbfs_with_node(), below, to
+ * it will register its per node attributes for all online nodes with
+ * memory.  It will also call register_hugetlbfs_with_node(), below, to
  * register its attribute registration functions with this node driver.
  * Once these hooks have been initialized, the node driver will call into
  * the hugetlb module to [un]register attributes for hot-plugged nodes.
@@ -192,7 +192,8 @@ static node_registration_func_t __hugetl
 
 static inline void hugetlb_register_node(struct node *node)
 {
-	if (__hugetlb_register_node)
+	if (__hugetlb_register_node &&
+			node_state(node->sysdev.id, N_HIGH_MEMORY))
 		__hugetlb_register_node(node);
 }
 
@@ -237,6 +238,7 @@ int register_node(struct node *node, int
 		sysdev_create_file(&node->sysdev, &attr_distance);
 
 		scan_unevictable_register_node(node);
+
 		hugetlb_register_node(node);
 	}
 	return error;
@@ -258,7 +260,7 @@ void unregister_node(struct node *node)
 	sysdev_remove_file(&node->sysdev, &attr_distance);
 
 	scan_unevictable_unregister_node(node);
-	hugetlb_unregister_node(node);
+	hugetlb_unregister_node(node);		/* no-op, if memoryless node */
 
 	sysdev_unregister(&node->sysdev);
 }
@@ -388,8 +390,45 @@ static int link_mem_sections(int nid)
 	}
 	return err;
 }
+
+/*
+ * Handle per node hstate attribute [un]registration on transistions
+ * to/from memoryless state.
+ */
+
+static int node_memory_callback(struct notifier_block *self,
+				unsigned long action, void *arg)
+{
+	struct memory_notify *mnb = arg;
+	int nid = mnb->status_change_nid;
+
+	switch (action) {
+	case MEM_ONLINE:    /* memory successfully brought online */
+		if (nid != NUMA_NO_NODE)
+			hugetlb_register_node(&node_devices[nid]);
+		break;
+	case MEM_OFFLINE:   /* or offline */
+		if (nid != NUMA_NO_NODE)
+			hugetlb_unregister_node(&node_devices[nid]);
+		break;
+	case MEM_GOING_ONLINE:
+	case MEM_GOING_OFFLINE:
+	case MEM_CANCEL_ONLINE:
+	case MEM_CANCEL_OFFLINE:
+	default:
+		break;
+	}
+
+	return NOTIFY_OK;
+}
 #else
 static int link_mem_sections(int nid) { return 0; }
+
+static inline int node_memory_callback(struct notifier_block *self,
+				unsigned long action, void *arg)
+{
+	return NOTIFY_OK;
+}
 #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
 
 int register_one_node(int nid)
@@ -503,13 +542,17 @@ static int node_states_init(void)
 	return err;
 }
 
+#define NODE_CALLBACK_PRI	2	/* lower than SLAB */
 static int __init register_node_type(void)
 {
 	int ret;
 
 	ret = sysdev_class_register(&node_class);
-	if (!ret)
+	if (!ret) {
 		ret = node_states_init();
+		hotplug_memory_notifier(node_memory_callback,
+					NODE_CALLBACK_PRI);
+	}
 
 	/*
 	 * Note:  we're not going to unregister the node class if we fail
Index: linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/Documentation/vm/hugetlbpage.txt	2009-09-15 13:52:04.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt	2009-09-15 13:52:07.000000000 -0400
@@ -231,7 +231,8 @@ resulting effect on persistent huge page
 Per Node Hugepages Attributes
 
 A subset of the contents of the root huge page control directory in sysfs,
-described above, has been replicated under each "node" system device in:
+described above, will be replicated under each the system device of each
+NUMA node with memory in:
 
 	/sys/devices/system/node/node[0-9]*/hugepages/
 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 11/11] hugetlb:  offload per node attribute registrations
  2009-09-15 20:43 [PATCH 0/11] hugetlb: V7 constrain allocation/free based on task mempolicy Lee Schermerhorn
@ 2009-09-15 20:45   ` Lee Schermerhorn
  2009-09-15 20:44   ` Lee Schermerhorn
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-15 20:45 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 11/11] hugetlb:  offload [un]registration of sysfs attr to worker thread

Against:  2.6.31-mmotm-090914-0157

New in V6

V7:  + remove redundant check for memory{ful|less} node from 
       node_hugetlb_work().  Rely on [added] return from
       hugetlb_register_node() to differentiate between transitions
       to/from memoryless state.

This patch offloads the registration and unregistration of per node
hstate sysfs attributes to a worker thread rather than attempt the
allocation/attachment or detachment/freeing of the attributes in 
the context of the memory hotplug handler.

N.B.,  Only tested build, boot, libhugetlbfs regression.
       i.e., no memory hotplug testing.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 drivers/base/node.c  |   51 ++++++++++++++++++++++++++++++++++++++++++---------
 include/linux/node.h |    5 +++++
 2 files changed, 47 insertions(+), 9 deletions(-)

Index: linux-2.6.31-mmotm-090914-0157/include/linux/node.h
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/include/linux/node.h	2009-09-15 13:42:22.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/include/linux/node.h	2009-09-15 13:52:09.000000000 -0400
@@ -21,9 +21,14 @@
 
 #include <linux/sysdev.h>
 #include <linux/cpumask.h>
+#include <linux/workqueue.h>
 
 struct node {
 	struct sys_device	sysdev;
+
+#if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
+	struct work_struct	node_work;
+#endif
 };
 
 struct memory_block;
Index: linux-2.6.31-mmotm-090914-0157/drivers/base/node.c
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/drivers/base/node.c	2009-09-15 13:52:07.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/drivers/base/node.c	2009-09-15 13:52:09.000000000 -0400
@@ -190,11 +190,14 @@ static SYSDEV_ATTR(distance, S_IRUGO, no
 static node_registration_func_t __hugetlb_register_node;
 static node_registration_func_t __hugetlb_unregister_node;
 
-static inline void hugetlb_register_node(struct node *node)
+static inline bool hugetlb_register_node(struct node *node)
 {
 	if (__hugetlb_register_node &&
-			node_state(node->sysdev.id, N_HIGH_MEMORY))
+			node_state(node->sysdev.id, N_HIGH_MEMORY)) {
 		__hugetlb_register_node(node);
+		return true;
+	}
+	return false;
 }
 
 static inline void hugetlb_unregister_node(struct node *node)
@@ -391,10 +394,31 @@ static int link_mem_sections(int nid)
 	return err;
 }
 
+#ifdef CONFIG_HUGETLBFS
 /*
  * Handle per node hstate attribute [un]registration on transistions
  * to/from memoryless state.
  */
+static void node_hugetlb_work(struct work_struct *work)
+{
+	struct node *node = container_of(work, struct node, node_work);
+
+	/*
+	 * We only get here when a node transitions to/from memoryless state.
+	 * We can detect which transition occurred by examining whether the
+	 * node has memory now.  hugetlb_register_node() already check this
+	 * so we try to register the attributes.  If that fails, then the
+	 * node has transitioned to memoryless, try to unregister the
+	 * attributes.
+	 */
+	if (!hugetlb_register_node(node))
+		hugetlb_unregister_node(node);
+}
+
+static void init_node_hugetlb_work(int nid)
+{
+	INIT_WORK(&node_devices[nid].node_work, node_hugetlb_work);
+}
 
 static int node_memory_callback(struct notifier_block *self,
 				unsigned long action, void *arg)
@@ -403,14 +427,16 @@ static int node_memory_callback(struct n
 	int nid = mnb->status_change_nid;
 
 	switch (action) {
-	case MEM_ONLINE:    /* memory successfully brought online */
+	case MEM_ONLINE:
+	case MEM_OFFLINE:
+		/*
+		 * offload per node hstate [un]registration to a work thread
+		 * when transitioning to/from memoryless state.
+		 */
 		if (nid != NUMA_NO_NODE)
-			hugetlb_register_node(&node_devices[nid]);
-		break;
-	case MEM_OFFLINE:   /* or offline */
-		if (nid != NUMA_NO_NODE)
-			hugetlb_unregister_node(&node_devices[nid]);
+			schedule_work(&node_devices[nid].node_work);
 		break;
+
 	case MEM_GOING_ONLINE:
 	case MEM_GOING_OFFLINE:
 	case MEM_CANCEL_ONLINE:
@@ -421,7 +447,8 @@ static int node_memory_callback(struct n
 
 	return NOTIFY_OK;
 }
-#else
+#endif	/* CONFIG_HUGETLBFS */
+#else	/* !CONFIG_MEMORY_HOTPLUG_SPARSE */
 static int link_mem_sections(int nid) { return 0; }
 
 static inline int node_memory_callback(struct notifier_block *self,
@@ -429,6 +456,9 @@ static inline int node_memory_callback(s
 {
 	return NOTIFY_OK;
 }
+
+static void init_node_hugetlb_work(int nid) { }
+
 #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
 
 int register_one_node(int nid)
@@ -453,6 +483,9 @@ int register_one_node(int nid)
 
 		/* link memory sections under this node */
 		error = link_mem_sections(nid);
+
+		/* initialize work queue for memory hot plug */
+		init_node_hugetlb_work(nid);
 	}
 
 	return error;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 11/11] hugetlb:  offload per node attribute registrations
@ 2009-09-15 20:45   ` Lee Schermerhorn
  0 siblings, 0 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-15 20:45 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 11/11] hugetlb:  offload [un]registration of sysfs attr to worker thread

Against:  2.6.31-mmotm-090914-0157

New in V6

V7:  + remove redundant check for memory{ful|less} node from 
       node_hugetlb_work().  Rely on [added] return from
       hugetlb_register_node() to differentiate between transitions
       to/from memoryless state.

This patch offloads the registration and unregistration of per node
hstate sysfs attributes to a worker thread rather than attempt the
allocation/attachment or detachment/freeing of the attributes in 
the context of the memory hotplug handler.

N.B.,  Only tested build, boot, libhugetlbfs regression.
       i.e., no memory hotplug testing.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 drivers/base/node.c  |   51 ++++++++++++++++++++++++++++++++++++++++++---------
 include/linux/node.h |    5 +++++
 2 files changed, 47 insertions(+), 9 deletions(-)

Index: linux-2.6.31-mmotm-090914-0157/include/linux/node.h
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/include/linux/node.h	2009-09-15 13:42:22.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/include/linux/node.h	2009-09-15 13:52:09.000000000 -0400
@@ -21,9 +21,14 @@
 
 #include <linux/sysdev.h>
 #include <linux/cpumask.h>
+#include <linux/workqueue.h>
 
 struct node {
 	struct sys_device	sysdev;
+
+#if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
+	struct work_struct	node_work;
+#endif
 };
 
 struct memory_block;
Index: linux-2.6.31-mmotm-090914-0157/drivers/base/node.c
===================================================================
--- linux-2.6.31-mmotm-090914-0157.orig/drivers/base/node.c	2009-09-15 13:52:07.000000000 -0400
+++ linux-2.6.31-mmotm-090914-0157/drivers/base/node.c	2009-09-15 13:52:09.000000000 -0400
@@ -190,11 +190,14 @@ static SYSDEV_ATTR(distance, S_IRUGO, no
 static node_registration_func_t __hugetlb_register_node;
 static node_registration_func_t __hugetlb_unregister_node;
 
-static inline void hugetlb_register_node(struct node *node)
+static inline bool hugetlb_register_node(struct node *node)
 {
 	if (__hugetlb_register_node &&
-			node_state(node->sysdev.id, N_HIGH_MEMORY))
+			node_state(node->sysdev.id, N_HIGH_MEMORY)) {
 		__hugetlb_register_node(node);
+		return true;
+	}
+	return false;
 }
 
 static inline void hugetlb_unregister_node(struct node *node)
@@ -391,10 +394,31 @@ static int link_mem_sections(int nid)
 	return err;
 }
 
+#ifdef CONFIG_HUGETLBFS
 /*
  * Handle per node hstate attribute [un]registration on transistions
  * to/from memoryless state.
  */
+static void node_hugetlb_work(struct work_struct *work)
+{
+	struct node *node = container_of(work, struct node, node_work);
+
+	/*
+	 * We only get here when a node transitions to/from memoryless state.
+	 * We can detect which transition occurred by examining whether the
+	 * node has memory now.  hugetlb_register_node() already check this
+	 * so we try to register the attributes.  If that fails, then the
+	 * node has transitioned to memoryless, try to unregister the
+	 * attributes.
+	 */
+	if (!hugetlb_register_node(node))
+		hugetlb_unregister_node(node);
+}
+
+static void init_node_hugetlb_work(int nid)
+{
+	INIT_WORK(&node_devices[nid].node_work, node_hugetlb_work);
+}
 
 static int node_memory_callback(struct notifier_block *self,
 				unsigned long action, void *arg)
@@ -403,14 +427,16 @@ static int node_memory_callback(struct n
 	int nid = mnb->status_change_nid;
 
 	switch (action) {
-	case MEM_ONLINE:    /* memory successfully brought online */
+	case MEM_ONLINE:
+	case MEM_OFFLINE:
+		/*
+		 * offload per node hstate [un]registration to a work thread
+		 * when transitioning to/from memoryless state.
+		 */
 		if (nid != NUMA_NO_NODE)
-			hugetlb_register_node(&node_devices[nid]);
-		break;
-	case MEM_OFFLINE:   /* or offline */
-		if (nid != NUMA_NO_NODE)
-			hugetlb_unregister_node(&node_devices[nid]);
+			schedule_work(&node_devices[nid].node_work);
 		break;
+
 	case MEM_GOING_ONLINE:
 	case MEM_GOING_OFFLINE:
 	case MEM_CANCEL_ONLINE:
@@ -421,7 +447,8 @@ static int node_memory_callback(struct n
 
 	return NOTIFY_OK;
 }
-#else
+#endif	/* CONFIG_HUGETLBFS */
+#else	/* !CONFIG_MEMORY_HOTPLUG_SPARSE */
 static int link_mem_sections(int nid) { return 0; }
 
 static inline int node_memory_callback(struct notifier_block *self,
@@ -429,6 +456,9 @@ static inline int node_memory_callback(s
 {
 	return NOTIFY_OK;
 }
+
+static void init_node_hugetlb_work(int nid) { }
+
 #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
 
 int register_one_node(int nid)
@@ -453,6 +483,9 @@ int register_one_node(int nid)
 
 		/* link memory sections under this node */
 		error = link_mem_sections(nid);
+
+		/* initialize work queue for memory hot plug */
+		init_node_hugetlb_work(nid);
 	}
 
 	return error;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/11] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-09-15 20:45 ` [PATCH 7/11] hugetlb: update hugetlb documentation for mempolicy based management Lee Schermerhorn
@ 2009-09-16 13:37   ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2009-09-16 13:37 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

On Tue, Sep 15, 2009 at 04:45:04PM -0400, Lee Schermerhorn wrote:
> [PATCH 7/11] hugetlb:  update hugetlb documentation for mempolicy based management.
> 
> Against:  2.6.31-mmotm-090914-0157
> 
> V2:  Add brief description of per node attributes.
> 
> V6:  address review comments
> 
> This patch updates the kernel huge tlb documentation to describe the
> numa memory policy based huge page management.  Additionaly, the patch
> includes a fair amount of rework to improve consistency, eliminate
> duplication and set the context for documenting the memory policy
> interaction.
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> Acked-by: David Rientjes <rientjes@google.com>
> 

Acked-by: Mel Gorman <mel@csn.ul.ie>

>  Documentation/vm/hugetlbpage.txt |  263 +++++++++++++++++++++++++--------------
>  1 file changed, 175 insertions(+), 88 deletions(-)
> 
> Index: linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt
> ===================================================================
> --- linux-2.6.31-mmotm-090914-0157.orig/Documentation/vm/hugetlbpage.txt	2009-09-15 13:22:53.000000000 -0400
> +++ linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt	2009-09-15 13:43:32.000000000 -0400
> @@ -11,23 +11,21 @@ This optimization is more critical now a
>  (several GBs) are more readily available.
>  
>  Users can use the huge page support in Linux kernel by either using the mmap
> -system call or standard SYSv shared memory system calls (shmget, shmat).
> +system call or standard SYSV shared memory system calls (shmget, shmat).
>  
>  First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
>  (present under "File systems") and CONFIG_HUGETLB_PAGE (selected
>  automatically when CONFIG_HUGETLBFS is selected) configuration
>  options.
>  
> -The kernel built with huge page support should show the number of configured
> -huge pages in the system by running the "cat /proc/meminfo" command.
> +The /proc/meminfo file provides information about the total number of
> +persistent hugetlb pages in the kernel's huge page pool.  It also displays
> +information about the number of free, reserved and surplus huge pages and the
> +default huge page size.  The huge page size is needed for generating the
> +proper alignment and size of the arguments to system calls that map huge page
> +regions.
>  
> -/proc/meminfo also provides information about the total number of hugetlb
> -pages configured in the kernel.  It also displays information about the
> -number of free hugetlb pages at any time.  It also displays information about
> -the configured huge page size - this is needed for generating the proper
> -alignment and size of the arguments to the above system calls.
> -
> -The output of "cat /proc/meminfo" will have lines like:
> +The output of "cat /proc/meminfo" will include lines like:
>  
>  .....
>  HugePages_Total: vvv
> @@ -53,59 +51,63 @@ HugePages_Surp  is short for "surplus,"
>  /proc/filesystems should also show a filesystem of type "hugetlbfs" configured
>  in the kernel.
>  
> -/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
> -pages in the kernel.  Super user can dynamically request more (or free some
> -pre-configured) huge pages.
> -The allocation (or deallocation) of hugetlb pages is possible only if there are
> -enough physically contiguous free pages in system (freeing of huge pages is
> -possible only if there are enough hugetlb pages free that can be transferred
> -back to regular memory pool).
> -
> -Pages that are used as hugetlb pages are reserved inside the kernel and cannot
> -be used for other purposes.
> -
> -Once the kernel with Hugetlb page support is built and running, a user can
> -use either the mmap system call or shared memory system calls to start using
> -the huge pages.  It is required that the system administrator preallocate
> -enough memory for huge page purposes.
> -
> -The administrator can preallocate huge pages on the kernel boot command line by
> -specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
> -requested.  This is the most reliable method for preallocating huge pages as
> -memory has not yet become fragmented.
> +/proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge
> +pages in the kernel's huge page pool.  "Persistent" huge pages will be
> +returned to the huge page pool when freed by a task.  A user with root
> +privileges can dynamically allocate more or free some persistent huge pages
> +by increasing or decreasing the value of 'nr_hugepages'.
> +
> +Pages that are used as huge pages are reserved inside the kernel and cannot
> +be used for other purposes.  Huge pages cannot be swapped out under
> +memory pressure.
> +
> +Once a number of huge pages have been pre-allocated to the kernel huge page
> +pool, a user with appropriate privilege can use either the mmap system call
> +or shared memory system calls to use the huge pages.  See the discussion of
> +Using Huge Pages, below.
> +
> +The administrator can allocate persistent huge pages on the kernel boot
> +command line by specifying the "hugepages=N" parameter, where 'N' = the
> +number of huge pages requested.  This is the most reliable method of
> +allocating huge pages as memory has not yet become fragmented.
>  
> -Some platforms support multiple huge page sizes.  To preallocate huge pages
> +Some platforms support multiple huge page sizes.  To allocate huge pages
>  of a specific size, one must preceed the huge pages boot command parameters
>  with a huge page size selection parameter "hugepagesz=<size>".  <size> must
>  be specified in bytes with optional scale suffix [kKmMgG].  The default huge
>  page size may be selected with the "default_hugepagesz=<size>" boot parameter.
>  
> -/proc/sys/vm/nr_hugepages indicates the current number of configured [default
> -size] hugetlb pages in the kernel.  Super user can dynamically request more
> -(or free some pre-configured) huge pages.
> -
> -Use the following command to dynamically allocate/deallocate default sized
> -huge pages:
> +When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
> +indicates the current number of pre-allocated huge pages of the default size.
> +Thus, one can use the following command to dynamically allocate/deallocate
> +default sized persistent huge pages:
>  
>  	echo 20 > /proc/sys/vm/nr_hugepages
>  
> -This command will try to configure 20 default sized huge pages in the system.
> +This command will try to adjust the number of default sized huge pages in the
> +huge page pool to 20, allocating or freeing huge pages, as required.
> +
>  On a NUMA platform, the kernel will attempt to distribute the huge page pool
> -over the all on-line nodes.  These huge pages, allocated when nr_hugepages
> -is increased, are called "persistent huge pages".
> +over all the set of allowed nodes specified by the NUMA memory policy of the
> +task that modifies nr_hugepages.  The default for the allowed nodes--when the
> +task has default memory policy--is all on-line nodes.  Allowed nodes with
> +insufficient available, contiguous memory for a huge page will be silently
> +skipped when allocating persistent huge pages.  See the discussion below of
> +the interaction of task memory policy, cpusets and per node attributes with
> +the allocation and freeing of persistent huge pages.
>  
>  The success or failure of huge page allocation depends on the amount of
> -physically contiguous memory that is preset in system at the time of the
> +physically contiguous memory that is present in system at the time of the
>  allocation attempt.  If the kernel is unable to allocate huge pages from
>  some nodes in a NUMA system, it will attempt to make up the difference by
>  allocating extra pages on other nodes with sufficient available contiguous
>  memory, if any.
>  
> -System administrators may want to put this command in one of the local rc init
> -files.  This will enable the kernel to request huge pages early in the boot
> -process when the possibility of getting physical contiguous pages is still
> -very high.  Administrators can verify the number of huge pages actually
> -allocated by checking the sysctl or meminfo.  To check the per node
> +System administrators may want to put this command in one of the local rc
> +init files.  This will enable the kernel to allocate huge pages early in
> +the boot process when the possibility of getting physical contiguous pages
> +is still very high.  Administrators can verify the number of huge pages
> +actually allocated by checking the sysctl or meminfo.  To check the per node
>  distribution of huge pages in a NUMA system, use:
>  
>  	cat /sys/devices/system/node/node*/meminfo | fgrep Huge
> @@ -113,39 +115,40 @@ distribution of huge pages in a NUMA sys
>  /proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
>  huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
>  requested by applications.  Writing any non-zero value into this file
> -indicates that the hugetlb subsystem is allowed to try to obtain "surplus"
> -huge pages from the buddy allocator, when the normal pool is exhausted. As
> -these surplus huge pages go out of use, they are freed back to the buddy
> -allocator.
> +indicates that the hugetlb subsystem is allowed to try to obtain that
> +number of "surplus" huge pages from the kernel's normal page pool, when the
> +persistent huge page pool is exhausted. As these surplus huge pages become
> +unused, they are freed back to the kernel's normal page pool.
>  
> -When increasing the huge page pool size via nr_hugepages, any surplus
> +When increasing the huge page pool size via nr_hugepages, any existing surplus
>  pages will first be promoted to persistent huge pages.  Then, additional
>  huge pages will be allocated, if necessary and if possible, to fulfill
> -the new huge page pool size.
> +the new persistent huge page pool size.
>  
> -The administrator may shrink the pool of preallocated huge pages for
> +The administrator may shrink the pool of persistent huge pages for
>  the default huge page size by setting the nr_hugepages sysctl to a
>  smaller value.  The kernel will attempt to balance the freeing of huge pages
> -across all on-line nodes.  Any free huge pages on the selected nodes will
> -be freed back to the buddy allocator.
> -
> -Caveat: Shrinking the pool via nr_hugepages such that it becomes less
> -than the number of huge pages in use will convert the balance to surplus
> -huge pages even if it would exceed the overcommit value.  As long as
> -this condition holds, however, no more surplus huge pages will be
> -allowed on the system until one of the two sysctls are increased
> -sufficiently, or the surplus huge pages go out of use and are freed.
> +across all nodes in the memory policy of the task modifying nr_hugepages.
> +Any free huge pages on the selected nodes will be freed back to the kernel's
> +normal page pool.
> +
> +Caveat: Shrinking the persistent huge page pool via nr_hugepages such that
> +it becomes less than the number of huge pages in use will convert the balance
> +of the in-use huge pages to surplus huge pages.  This will occur even if
> +the number of surplus pages it would exceed the overcommit value.  As long as
> +this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is
> +increased sufficiently, or the surplus huge pages go out of use and are freed--
> +no more surplus huge pages will be allowed to be allocated.
>  
>  With support for multiple huge page pools at run-time available, much of
> -the huge page userspace interface has been duplicated in sysfs. The above
> -information applies to the default huge page size which will be
> -controlled by the /proc interfaces for backwards compatibility. The root
> -huge page control directory in sysfs is:
> +the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs.
> +The /proc interfaces discussed above have been retained for backwards
> +compatibility. The root huge page control directory in sysfs is:
>  
>  	/sys/kernel/mm/hugepages
>  
>  For each huge page size supported by the running kernel, a subdirectory
> -will exist, of the form
> +will exist, of the form:
>  
>  	hugepages-${size}kB
>  
> @@ -159,6 +162,98 @@ Inside each of these directories, the sa
>  
>  which function as described above for the default huge page-sized case.
>  
> +
> +Interaction of Task Memory Policy with Huge Page Allocation/Freeing:
> +
> +Whether huge pages are allocated and freed via the /proc interface or
> +the /sysfs interface, the NUMA nodes from which huge pages are allocated
> +or freed are controlled by the NUMA memory policy of the task that modifies
> +the nr_hugepages parameter.  [nr_overcommit_hugepages is a global limit.]
> +
> +The recommended method to allocate or free huge pages to/from the kernel
> +huge page pool, using the nr_hugepages example above, is:
> +
> +    numactl --interleave <node-list> echo 20 >/proc/sys/vm/nr_hugepages
> +
> +or, more succinctly:
> +
> +    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages
> +
> +This will allocate or free abs(20 - nr_hugepages) to or from the nodes
> +specified in <node-list>, depending on whether nr_hugepages is initially
> +less than or greater than 20, respectively.  No huge pages will be
> +allocated nor freed on any node not included in the specified <node-list>.
> +
> +Any memory policy mode--bind, preferred, local or interleave--may be
> +used.  The effect on persistent huge page allocation is as follows:
> +
> +1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
> +   persistent huge pages will be distributed across the node or nodes
> +   specified in the mempolicy as if "interleave" had been specified.
> +   However, if a node in the policy does not contain sufficient contiguous
> +   memory for a huge page, the allocation will not "fallback" to the nearest
> +   neighbor node with sufficient contiguous memory.  To do this would cause
> +   undesirable imbalance in the distribution of the huge page pool, or
> +   possibly, allocation of persistent huge pages on nodes not allowed by
> +   the task's memory policy.
> +
> +2) One or more nodes may be specified with the bind or interleave policy.
> +   If more than one node is specified with the preferred policy, only the
> +   lowest numeric id will be used.  Local policy will select the node where
> +   the task is running at the time the nodes_allowed mask is constructed.
> +
> +3) For local policy to be deterministic, the task must be bound to a cpu or
> +   cpus in a single node.  Otherwise, the task could be migrated to some
> +   other node at any time after launch and the resulting node will be
> +   indeterminate.  Thus, local policy is not very useful for this purpose.
> +   Any of the other mempolicy modes may be used to specify a single node.
> +
> +4) The nodes allowed mask will be derived from any non-default task mempolicy,
> +   whether this policy was set explicitly by the task itself or one of its
> +   ancestors, such as numactl.  This means that if the task is invoked from a
> +   shell with non-default policy, that policy will be used.  One can specify a
> +   node list of "all" with numactl --interleave or --membind [-m] to achieve
> +   interleaving over all nodes in the system or cpuset.
> +
> +5) Any task mempolicy specifed--e.g., using numactl--will be constrained by
> +   the resource limits of any cpuset in which the task runs.  Thus, there will
> +   be no way for a task with non-default policy running in a cpuset with a
> +   subset of the system nodes to allocate huge pages outside the cpuset
> +   without first moving to a cpuset that contains all of the desired nodes.
> +
> +6) Boot-time huge page allocation attempts to distribute the requested number
> +   of huge pages over all on-lines nodes.
> +
> +Per Node Hugepages Attributes
> +
> +A subset of the contents of the root huge page control directory in sysfs,
> +described above, has been replicated under each "node" system device in:
> +
> +	/sys/devices/system/node/node[0-9]*/hugepages/
> +
> +Under this directory, the subdirectory for each supported huge page size
> +contains the following attribute files:
> +
> +	nr_hugepages
> +	free_hugepages
> +	surplus_hugepages
> +
> +The free_' and surplus_' attribute files are read-only.  They return the number
> +of free and surplus [overcommitted] huge pages, respectively, on the parent
> +node.
> +
> +The nr_hugepages attribute will return the total number of huge pages on the
> +specified node.  When this attribute is written, the number of persistent huge
> +pages on the parent node will be adjusted to the specified value, if sufficient
> +resources exist, regardless of the task's mempolicy or cpuset constraints.
> +
> +Note that the number of overcommit and reserve pages remain global quantities,
> +as we don't know until fault time, when the faulting task's mempolicy is applied,
> +from which node the huge page allocation will be attempted.
> +
> +
> +Using Huge Pages:
> +
>  If the user applications are going to request huge pages using mmap system
>  call, then it is required that system administrator mount a file system of
>  type hugetlbfs:
> @@ -206,9 +301,11 @@ map_hugetlb.c.
>   * requesting huge pages.
>   *
>   * For the ia64 architecture, the Linux kernel reserves Region number 4 for
> - * huge pages.  That means the addresses starting with 0x800000... will need
> - * to be specified.  Specifying a fixed address is not required on ppc64,
> - * i386 or x86_64.
> + * huge pages.  That means that if one requires a fixed address, a huge page
> + * aligned address starting with 0x800000... will be required.  If a fixed
> + * address is not required, the kernel will select an address in the proper
> + * range.
> + * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
>   *
>   * Note: The default shared memory limit is quite low on many kernels,
>   * you may need to increase it via:
> @@ -237,14 +334,8 @@ map_hugetlb.c.
>  
>  #define dprintf(x)  printf(x)
>  
> -/* Only ia64 requires this */
> -#ifdef __ia64__
> -#define ADDR (void *)(0x8000000000000000UL)
> -#define SHMAT_FLAGS (SHM_RND)
> -#else
> -#define ADDR (void *)(0x0UL)
> +#define ADDR (void *)(0x0UL)	/* let kernel choose address */
>  #define SHMAT_FLAGS (0)
> -#endif
>  
>  int main(void)
>  {
> @@ -302,10 +393,12 @@ int main(void)
>   * example, the app is requesting memory of size 256MB that is backed by
>   * huge pages.
>   *
> - * For ia64 architecture, Linux kernel reserves Region number 4 for huge pages.
> - * That means the addresses starting with 0x800000... will need to be
> - * specified.  Specifying a fixed address is not required on ppc64, i386
> - * or x86_64.
> + * For the ia64 architecture, the Linux kernel reserves Region number 4 for
> + * huge pages.  That means that if one requires a fixed address, a huge page
> + * aligned address starting with 0x800000... will be required.  If a fixed
> + * address is not required, the kernel will select an address in the proper
> + * range.
> + * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
>   */
>  #include <stdlib.h>
>  #include <stdio.h>
> @@ -317,14 +410,8 @@ int main(void)
>  #define LENGTH (256UL*1024*1024)
>  #define PROTECTION (PROT_READ | PROT_WRITE)
>  
> -/* Only ia64 requires this */
> -#ifdef __ia64__
> -#define ADDR (void *)(0x8000000000000000UL)
> -#define FLAGS (MAP_SHARED | MAP_FIXED)
> -#else
> -#define ADDR (void *)(0x0UL)
> +#define ADDR (void *)(0x0UL)	/* let kernel choose address */
>  #define FLAGS (MAP_SHARED)
> -#endif
>  
>  void check_bytes(char *addr)
>  {
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 8/11] hugetlb:  Optionally use mempolicy for persistent huge page allocation
  2009-09-15 20:45   ` Lee Schermerhorn
@ 2009-09-16 13:48     ` Mel Gorman
  -1 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2009-09-16 13:48 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

On Tue, Sep 15, 2009 at 04:45:10PM -0400, Lee Schermerhorn wrote:
> [PATCH 8/11] hugetlb:  Optionally use mempolicy for persistent huge page allocation
> 
> From: Mel Gorman <mel@csn.ul.ie>
> 
> Against:  2.6.31-mmotm-090914-0157
> 
> Patch "derive huge pages nodes allowed from task mempolicy" brought
> huge page support more in line with the core VM in that tuning the size
> of the static huge page pool would obey memory policies. Using this,
> administrators could interleave allocation of huge pages from a subset
> of nodes. This is consistent with how dynamic hugepage pool resizing
> works and how hugepages get allocated to applications at run-time.
> 
> However, it was pointed out that scripts may exist that depend on being
> able to drain all hugepages via /proc/sys/vm/nr_hugepages from processes
> that are running within a memory policy. This patch adds
> /proc/sys/vm/nr_hugepages_mempolicy which when written to will obey
> memory policies. /proc/sys/vm/nr_hugepages continues then to be a
> system-wide tunable regardless of memory policy.
> 
> Replicate the vm/nr_hugepages_mempolicy sysctl under the sysfs global
> hstate attributes directory.
> 
> Note:  with this patch, hugeadm will require update to write to the
> vm/nr_hugepages_mempolicy sysctl/attribute when one wants to adjust
> the hugepage pool on a specific set of nodes.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
> 
>  Documentation/vm/hugetlbpage.txt |   36 ++++++++-------
>  include/linux/hugetlb.h          |    6 ++
>  kernel/sysctl.c                  |   12 +++++
>  mm/hugetlb.c                     |   91 ++++++++++++++++++++++++++++++++-------
>  4 files changed, 114 insertions(+), 31 deletions(-)
> 
> Index: linux-2.6.31-mmotm-090914-0157/include/linux/hugetlb.h
> ===================================================================
> --- linux-2.6.31-mmotm-090914-0157.orig/include/linux/hugetlb.h	2009-09-15 13:23:01.000000000 -0400
> +++ linux-2.6.31-mmotm-090914-0157/include/linux/hugetlb.h	2009-09-15 13:48:11.000000000 -0400
> @@ -23,6 +23,12 @@ void reset_vma_resv_huge_pages(struct vm
>  int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
>  int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
>  int hugetlb_treat_movable_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> +
> +#ifdef CONFIG_NUMA
> +int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int,
> +				void __user *, size_t *, loff_t *);
> +#endif
> +
>  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
>  int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
>  			struct page **, struct vm_area_struct **,
> Index: linux-2.6.31-mmotm-090914-0157/kernel/sysctl.c
> ===================================================================
> --- linux-2.6.31-mmotm-090914-0157.orig/kernel/sysctl.c	2009-09-15 13:23:01.000000000 -0400
> +++ linux-2.6.31-mmotm-090914-0157/kernel/sysctl.c	2009-09-15 13:43:36.000000000 -0400
> @@ -1170,6 +1170,18 @@ static struct ctl_table vm_table[] = {
>  		.extra1		= (void *)&hugetlb_zero,
>  		.extra2		= (void *)&hugetlb_infinity,
>  	 },
> +#ifdef CONFIG_NUMA
> +	 {
> +		.ctl_name	= CTL_UNNUMBERED,
> +		.procname	= "nr_hugepages_mempolicy",
> +		.data		= NULL,
> +		.maxlen		= sizeof(unsigned long),
> +		.mode		= 0644,
> +		.proc_handler	= &hugetlb_mempolicy_sysctl_handler,
> +		.extra1		= (void *)&hugetlb_zero,
> +		.extra2		= (void *)&hugetlb_infinity,
> +	 },
> +#endif
>  	 {
>  		.ctl_name	= VM_HUGETLB_GROUP,
>  		.procname	= "hugetlb_shm_group",
> Index: linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-mmotm-090914-0157.orig/mm/hugetlb.c	2009-09-15 13:43:13.000000000 -0400
> +++ linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c	2009-09-15 13:50:28.000000000 -0400
> @@ -1243,6 +1243,7 @@ static int adjust_pool_surplus(struct hs
>  	return ret;
>  }
>  
> +#define HUGETLB_NO_NODE_OBEY_MEMPOLICY (NUMA_NO_NODE - 1)
>  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
>  static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
>  								int nid)

As a pre-emptive note to David. I made a quick stab at getting rid of
HUGETLB_NO_NODE_OBEY_MEMPOLICY by allocating the nodemask higher up in
the chain. However, it was getting progressively more horrible looking
and decided that the current definition was easier to understand. I
would be biased though.

> @@ -1253,9 +1254,14 @@ static unsigned long set_max_huge_pages(
>  	if (h->order >= MAX_ORDER)
>  		return h->max_huge_pages;
>  
> -	if (nid == NUMA_NO_NODE) {
> +	switch (nid) {
> +	case HUGETLB_NO_NODE_OBEY_MEMPOLICY:
>  		nodes_allowed = alloc_nodemask_of_mempolicy();
> -	} else {
> +		break;
> +	case NUMA_NO_NODE:
> +		nodes_allowed = &node_online_map;
> +		break;
> +	default:
>  		/*
>  		 * incoming 'count' is for node 'nid' only, so
>  		 * adjust count to global, but restrict alloc/free
> @@ -1354,23 +1360,24 @@ static struct hstate *kobj_to_hstate(str
>  
>  	for (i = 0; i < HUGE_MAX_HSTATE; i++)
>  		if (hstate_kobjs[i] == kobj) {
> -			if (nidp)
> -				*nidp = NUMA_NO_NODE;
> +			/*
> +			 * let *nidp default.
> +			 */
>  			return &hstates[i];
>  		}
>  
>  	return kobj_to_node_hstate(kobj, nidp);
>  }
>  
> -static ssize_t nr_hugepages_show(struct kobject *kobj,
> +static ssize_t nr_hugepages_show_common(int nid_default, struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
>  	struct hstate *h;
>  	unsigned long nr_huge_pages;
> -	int nid;
> +	int nid = nid_default;
>  
>  	h = kobj_to_hstate(kobj, &nid);
> -	if (nid == NUMA_NO_NODE)
> +	if (nid < 0)
>  		nr_huge_pages = h->nr_huge_pages;
>  	else
>  		nr_huge_pages = h->nr_huge_pages_node[nid];
> @@ -1378,12 +1385,12 @@ static ssize_t nr_hugepages_show(struct
>  	return sprintf(buf, "%lu\n", nr_huge_pages);
>  }
>  
> -static ssize_t nr_hugepages_store(struct kobject *kobj,
> +static ssize_t nr_hugepages_store_common(int nid_default, struct kobject *kobj,
>  		struct kobj_attribute *attr, const char *buf, size_t len)
>  {
>  	unsigned long count;
>  	struct hstate *h;
> -	int nid;
> +	int nid = nid_default;
>  	int err;
>  
>  	err = strict_strtoul(buf, 10, &count);
> @@ -1395,8 +1402,42 @@ static ssize_t nr_hugepages_store(struct
>  
>  	return len;
>  }
> +
> +static ssize_t nr_hugepages_show(struct kobject *kobj,
> +					struct kobj_attribute *attr, char *buf)
> +{
> +	return nr_hugepages_show_common(NUMA_NO_NODE, kobj, attr, buf);
> +}
> +
> +static ssize_t nr_hugepages_store(struct kobject *kobj,
> +		struct kobj_attribute *attr, const char *buf, size_t len)
> +{
> +	return nr_hugepages_store_common(NUMA_NO_NODE, kobj, attr, buf, len);
> +}
>  HSTATE_ATTR(nr_hugepages);
>  
> +#ifdef CONFIG_NUMA
> +
> +/*
> + * hstate attribute for optionally mempolicy-based constraint on persistent
> + * huge page alloc/free.
> + */
> +static ssize_t nr_hugepages_mempolicy_show(struct kobject *kobj,
> +					struct kobj_attribute *attr, char *buf)
> +{
> +	return nr_hugepages_show_common(HUGETLB_NO_NODE_OBEY_MEMPOLICY,
> +						kobj, attr, buf);
> +}
> +
> +static ssize_t nr_hugepages_mempolicy_store(struct kobject *kobj,
> +		struct kobj_attribute *attr, const char *buf, size_t len)
> +{
> +	return nr_hugepages_store_common(HUGETLB_NO_NODE_OBEY_MEMPOLICY,
> +					kobj, attr, buf, len);
> +}
> +HSTATE_ATTR(nr_hugepages_mempolicy);
> +#endif
> +
>  static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> @@ -1429,7 +1470,7 @@ static ssize_t free_hugepages_show(struc
>  {
>  	struct hstate *h;
>  	unsigned long free_huge_pages;
> -	int nid;
> +	int nid = NUMA_NO_NODE;
>  
>  	h = kobj_to_hstate(kobj, &nid);
>  	if (nid == NUMA_NO_NODE)
> @@ -1454,7 +1495,7 @@ static ssize_t surplus_hugepages_show(st
>  {
>  	struct hstate *h;
>  	unsigned long surplus_huge_pages;
> -	int nid;
> +	int nid = NUMA_NO_NODE;
>  
>  	h = kobj_to_hstate(kobj, &nid);
>  	if (nid == NUMA_NO_NODE)
> @@ -1472,6 +1513,9 @@ static struct attribute *hstate_attrs[]
>  	&free_hugepages_attr.attr,
>  	&resv_hugepages_attr.attr,
>  	&surplus_hugepages_attr.attr,
> +#ifdef CONFIG_NUMA
> +	&nr_hugepages_mempolicy_attr.attr,
> +#endif
>  	NULL,
>  };
>  
> @@ -1809,9 +1853,9 @@ static unsigned int cpuset_mems_nr(unsig
>  }
>  
>  #ifdef CONFIG_SYSCTL
> -int hugetlb_sysctl_handler(struct ctl_table *table, int write,
> -			   void __user *buffer,
> -			   size_t *length, loff_t *ppos)
> +static int hugetlb_sysctl_handler_common(int no_node,
> +			 struct ctl_table *table, int write,
> +			 void __user *buffer, size_t *length, loff_t *ppos)
>  {
>  	struct hstate *h = &default_hstate;
>  	unsigned long tmp;
> @@ -1824,7 +1868,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
>  	proc_doulongvec_minmax(table, write, buffer, length, ppos);
>  
>  	if (write)
> -		h->max_huge_pages = set_max_huge_pages(h, tmp, NUMA_NO_NODE);
> +		h->max_huge_pages = set_max_huge_pages(h, tmp, no_node);
>  
>  	return 0;
>  }
> @@ -1864,6 +1908,23 @@ int hugetlb_overcommit_handler(struct ct
>  	return 0;
>  }
>  
> +int hugetlb_sysctl_handler(struct ctl_table *table, int write,
> +			   void __user *buffer, size_t *length, loff_t *ppos)
> +{
> +
> +	return hugetlb_sysctl_handler_common(NUMA_NO_NODE,
> +				table, write, buffer, length, ppos);
> +}
> +
> +#ifdef CONFIG_NUMA
> +int hugetlb_mempolicy_sysctl_handler(struct ctl_table *table, int write,
> +			   void __user *buffer, size_t *length, loff_t *ppos)
> +{
> +	return hugetlb_sysctl_handler_common(HUGETLB_NO_NODE_OBEY_MEMPOLICY,
> +				table, write, buffer, length, ppos);
> +}
> +#endif /* CONFIG_NUMA */
> +
>  #endif /* CONFIG_SYSCTL */
>  
>  void hugetlb_report_meminfo(struct seq_file *m)
> Index: linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt
> ===================================================================
> --- linux-2.6.31-mmotm-090914-0157.orig/Documentation/vm/hugetlbpage.txt	2009-09-15 13:43:32.000000000 -0400
> +++ linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt	2009-09-15 13:43:36.000000000 -0400
> @@ -155,6 +155,7 @@ will exist, of the form:
>  Inside each of these directories, the same set of files will exist:
>  
>  	nr_hugepages
> +	nr_hugepages_mempolicy
>  	nr_overcommit_hugepages
>  	free_hugepages
>  	resv_hugepages
> @@ -166,26 +167,30 @@ which function as described above for th
>  Interaction of Task Memory Policy with Huge Page Allocation/Freeing:
>  
>  Whether huge pages are allocated and freed via the /proc interface or
> -the /sysfs interface, the NUMA nodes from which huge pages are allocated
> -or freed are controlled by the NUMA memory policy of the task that modifies
> -the nr_hugepages parameter.  [nr_overcommit_hugepages is a global limit.]
> +the /sysfs interface using the nr_hugepages_mempolicy attribute, the NUMA
> +nodes from which huge pages are allocated or freed are controlled by the
> +NUMA memory policy of the task that modifies the nr_hugepages_mempolicy
> +sysctl or attribute.  When the nr_hugepages attribute is used, mempolicy
> +is ignored
>  
>  The recommended method to allocate or free huge pages to/from the kernel
>  huge page pool, using the nr_hugepages example above, is:
>  
> -    numactl --interleave <node-list> echo 20 >/proc/sys/vm/nr_hugepages
> +    numactl --interleave <node-list> echo 20 \
> +				>/proc/sys/vm/nr_hugepages_mempolicy
>  
>  or, more succinctly:
>  
> -    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages
> +    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy
>  
>  This will allocate or free abs(20 - nr_hugepages) to or from the nodes
> -specified in <node-list>, depending on whether nr_hugepages is initially
> -less than or greater than 20, respectively.  No huge pages will be
> +specified in <node-list>, depending on whether number of persistent huge pages
> +is initially less than or greater than 20, respectively.  No huge pages will be
>  allocated nor freed on any node not included in the specified <node-list>.
>  
> -Any memory policy mode--bind, preferred, local or interleave--may be
> -used.  The effect on persistent huge page allocation is as follows:
> +When adjusting the persistent hugepage count via nr_hugepages_mempolicy, any
> +memory policy mode--bind, preferred, local or interleave--may be used.  The
> +resulting effect on persistent huge page allocation is as follows:
>  
>  1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
>     persistent huge pages will be distributed across the node or nodes
> @@ -201,27 +206,26 @@ used.  The effect on persistent huge pag
>     If more than one node is specified with the preferred policy, only the
>     lowest numeric id will be used.  Local policy will select the node where
>     the task is running at the time the nodes_allowed mask is constructed.
> -
> -3) For local policy to be deterministic, the task must be bound to a cpu or
> +   For local policy to be deterministic, the task must be bound to a cpu or
>     cpus in a single node.  Otherwise, the task could be migrated to some
>     other node at any time after launch and the resulting node will be
>     indeterminate.  Thus, local policy is not very useful for this purpose.
>     Any of the other mempolicy modes may be used to specify a single node.
>  
> -4) The nodes allowed mask will be derived from any non-default task mempolicy,
> +3) The nodes allowed mask will be derived from any non-default task mempolicy,
>     whether this policy was set explicitly by the task itself or one of its
>     ancestors, such as numactl.  This means that if the task is invoked from a
>     shell with non-default policy, that policy will be used.  One can specify a
>     node list of "all" with numactl --interleave or --membind [-m] to achieve
>     interleaving over all nodes in the system or cpuset.
>  
> -5) Any task mempolicy specifed--e.g., using numactl--will be constrained by
> +4) Any task mempolicy specifed--e.g., using numactl--will be constrained by
>     the resource limits of any cpuset in which the task runs.  Thus, there will
>     be no way for a task with non-default policy running in a cpuset with a
>     subset of the system nodes to allocate huge pages outside the cpuset
>     without first moving to a cpuset that contains all of the desired nodes.
>  
> -6) Boot-time huge page allocation attempts to distribute the requested number
> +5) Boot-time huge page allocation attempts to distribute the requested number
>     of huge pages over all on-lines nodes.
>  
>  Per Node Hugepages Attributes
> @@ -248,8 +252,8 @@ pages on the parent node will be adjuste
>  resources exist, regardless of the task's mempolicy or cpuset constraints.
>  
>  Note that the number of overcommit and reserve pages remain global quantities,
> -as we don't know until fault time, when the faulting task's mempolicy is applied,
> -from which node the huge page allocation will be attempted.
> +as we don't know until fault time, when the faulting task's mempolicy is
> +applied, from which node the huge page allocation will be attempted.
>  
>  
>  Using Huge Pages:
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 8/11] hugetlb:  Optionally use mempolicy for persistent huge page allocation
@ 2009-09-16 13:48     ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2009-09-16 13:48 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

On Tue, Sep 15, 2009 at 04:45:10PM -0400, Lee Schermerhorn wrote:
> [PATCH 8/11] hugetlb:  Optionally use mempolicy for persistent huge page allocation
> 
> From: Mel Gorman <mel@csn.ul.ie>
> 
> Against:  2.6.31-mmotm-090914-0157
> 
> Patch "derive huge pages nodes allowed from task mempolicy" brought
> huge page support more in line with the core VM in that tuning the size
> of the static huge page pool would obey memory policies. Using this,
> administrators could interleave allocation of huge pages from a subset
> of nodes. This is consistent with how dynamic hugepage pool resizing
> works and how hugepages get allocated to applications at run-time.
> 
> However, it was pointed out that scripts may exist that depend on being
> able to drain all hugepages via /proc/sys/vm/nr_hugepages from processes
> that are running within a memory policy. This patch adds
> /proc/sys/vm/nr_hugepages_mempolicy which when written to will obey
> memory policies. /proc/sys/vm/nr_hugepages continues then to be a
> system-wide tunable regardless of memory policy.
> 
> Replicate the vm/nr_hugepages_mempolicy sysctl under the sysfs global
> hstate attributes directory.
> 
> Note:  with this patch, hugeadm will require update to write to the
> vm/nr_hugepages_mempolicy sysctl/attribute when one wants to adjust
> the hugepage pool on a specific set of nodes.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
> 
>  Documentation/vm/hugetlbpage.txt |   36 ++++++++-------
>  include/linux/hugetlb.h          |    6 ++
>  kernel/sysctl.c                  |   12 +++++
>  mm/hugetlb.c                     |   91 ++++++++++++++++++++++++++++++++-------
>  4 files changed, 114 insertions(+), 31 deletions(-)
> 
> Index: linux-2.6.31-mmotm-090914-0157/include/linux/hugetlb.h
> ===================================================================
> --- linux-2.6.31-mmotm-090914-0157.orig/include/linux/hugetlb.h	2009-09-15 13:23:01.000000000 -0400
> +++ linux-2.6.31-mmotm-090914-0157/include/linux/hugetlb.h	2009-09-15 13:48:11.000000000 -0400
> @@ -23,6 +23,12 @@ void reset_vma_resv_huge_pages(struct vm
>  int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
>  int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
>  int hugetlb_treat_movable_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> +
> +#ifdef CONFIG_NUMA
> +int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int,
> +				void __user *, size_t *, loff_t *);
> +#endif
> +
>  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
>  int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
>  			struct page **, struct vm_area_struct **,
> Index: linux-2.6.31-mmotm-090914-0157/kernel/sysctl.c
> ===================================================================
> --- linux-2.6.31-mmotm-090914-0157.orig/kernel/sysctl.c	2009-09-15 13:23:01.000000000 -0400
> +++ linux-2.6.31-mmotm-090914-0157/kernel/sysctl.c	2009-09-15 13:43:36.000000000 -0400
> @@ -1170,6 +1170,18 @@ static struct ctl_table vm_table[] = {
>  		.extra1		= (void *)&hugetlb_zero,
>  		.extra2		= (void *)&hugetlb_infinity,
>  	 },
> +#ifdef CONFIG_NUMA
> +	 {
> +		.ctl_name	= CTL_UNNUMBERED,
> +		.procname	= "nr_hugepages_mempolicy",
> +		.data		= NULL,
> +		.maxlen		= sizeof(unsigned long),
> +		.mode		= 0644,
> +		.proc_handler	= &hugetlb_mempolicy_sysctl_handler,
> +		.extra1		= (void *)&hugetlb_zero,
> +		.extra2		= (void *)&hugetlb_infinity,
> +	 },
> +#endif
>  	 {
>  		.ctl_name	= VM_HUGETLB_GROUP,
>  		.procname	= "hugetlb_shm_group",
> Index: linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-mmotm-090914-0157.orig/mm/hugetlb.c	2009-09-15 13:43:13.000000000 -0400
> +++ linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c	2009-09-15 13:50:28.000000000 -0400
> @@ -1243,6 +1243,7 @@ static int adjust_pool_surplus(struct hs
>  	return ret;
>  }
>  
> +#define HUGETLB_NO_NODE_OBEY_MEMPOLICY (NUMA_NO_NODE - 1)
>  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
>  static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
>  								int nid)

As a pre-emptive note to David. I made a quick stab at getting rid of
HUGETLB_NO_NODE_OBEY_MEMPOLICY by allocating the nodemask higher up in
the chain. However, it was getting progressively more horrible looking
and decided that the current definition was easier to understand. I
would be biased though.

> @@ -1253,9 +1254,14 @@ static unsigned long set_max_huge_pages(
>  	if (h->order >= MAX_ORDER)
>  		return h->max_huge_pages;
>  
> -	if (nid == NUMA_NO_NODE) {
> +	switch (nid) {
> +	case HUGETLB_NO_NODE_OBEY_MEMPOLICY:
>  		nodes_allowed = alloc_nodemask_of_mempolicy();
> -	} else {
> +		break;
> +	case NUMA_NO_NODE:
> +		nodes_allowed = &node_online_map;
> +		break;
> +	default:
>  		/*
>  		 * incoming 'count' is for node 'nid' only, so
>  		 * adjust count to global, but restrict alloc/free
> @@ -1354,23 +1360,24 @@ static struct hstate *kobj_to_hstate(str
>  
>  	for (i = 0; i < HUGE_MAX_HSTATE; i++)
>  		if (hstate_kobjs[i] == kobj) {
> -			if (nidp)
> -				*nidp = NUMA_NO_NODE;
> +			/*
> +			 * let *nidp default.
> +			 */
>  			return &hstates[i];
>  		}
>  
>  	return kobj_to_node_hstate(kobj, nidp);
>  }
>  
> -static ssize_t nr_hugepages_show(struct kobject *kobj,
> +static ssize_t nr_hugepages_show_common(int nid_default, struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
>  	struct hstate *h;
>  	unsigned long nr_huge_pages;
> -	int nid;
> +	int nid = nid_default;
>  
>  	h = kobj_to_hstate(kobj, &nid);
> -	if (nid == NUMA_NO_NODE)
> +	if (nid < 0)
>  		nr_huge_pages = h->nr_huge_pages;
>  	else
>  		nr_huge_pages = h->nr_huge_pages_node[nid];
> @@ -1378,12 +1385,12 @@ static ssize_t nr_hugepages_show(struct
>  	return sprintf(buf, "%lu\n", nr_huge_pages);
>  }
>  
> -static ssize_t nr_hugepages_store(struct kobject *kobj,
> +static ssize_t nr_hugepages_store_common(int nid_default, struct kobject *kobj,
>  		struct kobj_attribute *attr, const char *buf, size_t len)
>  {
>  	unsigned long count;
>  	struct hstate *h;
> -	int nid;
> +	int nid = nid_default;
>  	int err;
>  
>  	err = strict_strtoul(buf, 10, &count);
> @@ -1395,8 +1402,42 @@ static ssize_t nr_hugepages_store(struct
>  
>  	return len;
>  }
> +
> +static ssize_t nr_hugepages_show(struct kobject *kobj,
> +					struct kobj_attribute *attr, char *buf)
> +{
> +	return nr_hugepages_show_common(NUMA_NO_NODE, kobj, attr, buf);
> +}
> +
> +static ssize_t nr_hugepages_store(struct kobject *kobj,
> +		struct kobj_attribute *attr, const char *buf, size_t len)
> +{
> +	return nr_hugepages_store_common(NUMA_NO_NODE, kobj, attr, buf, len);
> +}
>  HSTATE_ATTR(nr_hugepages);
>  
> +#ifdef CONFIG_NUMA
> +
> +/*
> + * hstate attribute for optionally mempolicy-based constraint on persistent
> + * huge page alloc/free.
> + */
> +static ssize_t nr_hugepages_mempolicy_show(struct kobject *kobj,
> +					struct kobj_attribute *attr, char *buf)
> +{
> +	return nr_hugepages_show_common(HUGETLB_NO_NODE_OBEY_MEMPOLICY,
> +						kobj, attr, buf);
> +}
> +
> +static ssize_t nr_hugepages_mempolicy_store(struct kobject *kobj,
> +		struct kobj_attribute *attr, const char *buf, size_t len)
> +{
> +	return nr_hugepages_store_common(HUGETLB_NO_NODE_OBEY_MEMPOLICY,
> +					kobj, attr, buf, len);
> +}
> +HSTATE_ATTR(nr_hugepages_mempolicy);
> +#endif
> +
>  static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> @@ -1429,7 +1470,7 @@ static ssize_t free_hugepages_show(struc
>  {
>  	struct hstate *h;
>  	unsigned long free_huge_pages;
> -	int nid;
> +	int nid = NUMA_NO_NODE;
>  
>  	h = kobj_to_hstate(kobj, &nid);
>  	if (nid == NUMA_NO_NODE)
> @@ -1454,7 +1495,7 @@ static ssize_t surplus_hugepages_show(st
>  {
>  	struct hstate *h;
>  	unsigned long surplus_huge_pages;
> -	int nid;
> +	int nid = NUMA_NO_NODE;
>  
>  	h = kobj_to_hstate(kobj, &nid);
>  	if (nid == NUMA_NO_NODE)
> @@ -1472,6 +1513,9 @@ static struct attribute *hstate_attrs[]
>  	&free_hugepages_attr.attr,
>  	&resv_hugepages_attr.attr,
>  	&surplus_hugepages_attr.attr,
> +#ifdef CONFIG_NUMA
> +	&nr_hugepages_mempolicy_attr.attr,
> +#endif
>  	NULL,
>  };
>  
> @@ -1809,9 +1853,9 @@ static unsigned int cpuset_mems_nr(unsig
>  }
>  
>  #ifdef CONFIG_SYSCTL
> -int hugetlb_sysctl_handler(struct ctl_table *table, int write,
> -			   void __user *buffer,
> -			   size_t *length, loff_t *ppos)
> +static int hugetlb_sysctl_handler_common(int no_node,
> +			 struct ctl_table *table, int write,
> +			 void __user *buffer, size_t *length, loff_t *ppos)
>  {
>  	struct hstate *h = &default_hstate;
>  	unsigned long tmp;
> @@ -1824,7 +1868,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
>  	proc_doulongvec_minmax(table, write, buffer, length, ppos);
>  
>  	if (write)
> -		h->max_huge_pages = set_max_huge_pages(h, tmp, NUMA_NO_NODE);
> +		h->max_huge_pages = set_max_huge_pages(h, tmp, no_node);
>  
>  	return 0;
>  }
> @@ -1864,6 +1908,23 @@ int hugetlb_overcommit_handler(struct ct
>  	return 0;
>  }
>  
> +int hugetlb_sysctl_handler(struct ctl_table *table, int write,
> +			   void __user *buffer, size_t *length, loff_t *ppos)
> +{
> +
> +	return hugetlb_sysctl_handler_common(NUMA_NO_NODE,
> +				table, write, buffer, length, ppos);
> +}
> +
> +#ifdef CONFIG_NUMA
> +int hugetlb_mempolicy_sysctl_handler(struct ctl_table *table, int write,
> +			   void __user *buffer, size_t *length, loff_t *ppos)
> +{
> +	return hugetlb_sysctl_handler_common(HUGETLB_NO_NODE_OBEY_MEMPOLICY,
> +				table, write, buffer, length, ppos);
> +}
> +#endif /* CONFIG_NUMA */
> +
>  #endif /* CONFIG_SYSCTL */
>  
>  void hugetlb_report_meminfo(struct seq_file *m)
> Index: linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt
> ===================================================================
> --- linux-2.6.31-mmotm-090914-0157.orig/Documentation/vm/hugetlbpage.txt	2009-09-15 13:43:32.000000000 -0400
> +++ linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt	2009-09-15 13:43:36.000000000 -0400
> @@ -155,6 +155,7 @@ will exist, of the form:
>  Inside each of these directories, the same set of files will exist:
>  
>  	nr_hugepages
> +	nr_hugepages_mempolicy
>  	nr_overcommit_hugepages
>  	free_hugepages
>  	resv_hugepages
> @@ -166,26 +167,30 @@ which function as described above for th
>  Interaction of Task Memory Policy with Huge Page Allocation/Freeing:
>  
>  Whether huge pages are allocated and freed via the /proc interface or
> -the /sysfs interface, the NUMA nodes from which huge pages are allocated
> -or freed are controlled by the NUMA memory policy of the task that modifies
> -the nr_hugepages parameter.  [nr_overcommit_hugepages is a global limit.]
> +the /sysfs interface using the nr_hugepages_mempolicy attribute, the NUMA
> +nodes from which huge pages are allocated or freed are controlled by the
> +NUMA memory policy of the task that modifies the nr_hugepages_mempolicy
> +sysctl or attribute.  When the nr_hugepages attribute is used, mempolicy
> +is ignored
>  
>  The recommended method to allocate or free huge pages to/from the kernel
>  huge page pool, using the nr_hugepages example above, is:
>  
> -    numactl --interleave <node-list> echo 20 >/proc/sys/vm/nr_hugepages
> +    numactl --interleave <node-list> echo 20 \
> +				>/proc/sys/vm/nr_hugepages_mempolicy
>  
>  or, more succinctly:
>  
> -    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages
> +    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy
>  
>  This will allocate or free abs(20 - nr_hugepages) to or from the nodes
> -specified in <node-list>, depending on whether nr_hugepages is initially
> -less than or greater than 20, respectively.  No huge pages will be
> +specified in <node-list>, depending on whether number of persistent huge pages
> +is initially less than or greater than 20, respectively.  No huge pages will be
>  allocated nor freed on any node not included in the specified <node-list>.
>  
> -Any memory policy mode--bind, preferred, local or interleave--may be
> -used.  The effect on persistent huge page allocation is as follows:
> +When adjusting the persistent hugepage count via nr_hugepages_mempolicy, any
> +memory policy mode--bind, preferred, local or interleave--may be used.  The
> +resulting effect on persistent huge page allocation is as follows:
>  
>  1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
>     persistent huge pages will be distributed across the node or nodes
> @@ -201,27 +206,26 @@ used.  The effect on persistent huge pag
>     If more than one node is specified with the preferred policy, only the
>     lowest numeric id will be used.  Local policy will select the node where
>     the task is running at the time the nodes_allowed mask is constructed.
> -
> -3) For local policy to be deterministic, the task must be bound to a cpu or
> +   For local policy to be deterministic, the task must be bound to a cpu or
>     cpus in a single node.  Otherwise, the task could be migrated to some
>     other node at any time after launch and the resulting node will be
>     indeterminate.  Thus, local policy is not very useful for this purpose.
>     Any of the other mempolicy modes may be used to specify a single node.
>  
> -4) The nodes allowed mask will be derived from any non-default task mempolicy,
> +3) The nodes allowed mask will be derived from any non-default task mempolicy,
>     whether this policy was set explicitly by the task itself or one of its
>     ancestors, such as numactl.  This means that if the task is invoked from a
>     shell with non-default policy, that policy will be used.  One can specify a
>     node list of "all" with numactl --interleave or --membind [-m] to achieve
>     interleaving over all nodes in the system or cpuset.
>  
> -5) Any task mempolicy specifed--e.g., using numactl--will be constrained by
> +4) Any task mempolicy specifed--e.g., using numactl--will be constrained by
>     the resource limits of any cpuset in which the task runs.  Thus, there will
>     be no way for a task with non-default policy running in a cpuset with a
>     subset of the system nodes to allocate huge pages outside the cpuset
>     without first moving to a cpuset that contains all of the desired nodes.
>  
> -6) Boot-time huge page allocation attempts to distribute the requested number
> +5) Boot-time huge page allocation attempts to distribute the requested number
>     of huge pages over all on-lines nodes.
>  
>  Per Node Hugepages Attributes
> @@ -248,8 +252,8 @@ pages on the parent node will be adjuste
>  resources exist, regardless of the task's mempolicy or cpuset constraints.
>  
>  Note that the number of overcommit and reserve pages remain global quantities,
> -as we don't know until fault time, when the faulting task's mempolicy is applied,
> -from which node the huge page allocation will be attempted.
> +as we don't know until fault time, when the faulting task's mempolicy is
> +applied, from which node the huge page allocation will be attempted.
>  
>  
>  Using Huge Pages:
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/11] hugetlb:  add generic definition of NUMA_NO_NODE
  2009-09-15 20:44   ` Lee Schermerhorn
@ 2009-09-17 13:28     ` Mel Gorman
  -1 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2009-09-17 13:28 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

On Tue, Sep 15, 2009 at 04:44:52PM -0400, Lee Schermerhorn wrote:
> [PATCH 5/11] - hugetlb:  promote NUMA_NO_NODE to generic constant
> 
> Against:  2.6.31-mmotm-090914-0157
> 
> New in V7 of series
> 
> Move definition of NUMA_NO_NODE from ia64 and x86_64 arch specific
> headers to generic header 'linux/numa.h' for use in generic code.
> NUMA_NO_NODE replaces bare '-1' where it's used in this series to
> indicate "no node id specified".  Ultimately, it can be used
> to replace the -1 elsewhere where it is used similarly.
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> Acked-by: David Rientjes <rientjes@google.com>
> 

Acked-by: Mel Gorman <mel@csn.ul.ie>

>  arch/ia64/include/asm/numa.h    |    2 --
>  arch/x86/include/asm/topology.h |    5 ++---
>  include/linux/numa.h            |    2 ++
>  3 files changed, 4 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6.31-mmotm-090914-0157/arch/ia64/include/asm/numa.h
> ===================================================================
> --- linux-2.6.31-mmotm-090914-0157.orig/arch/ia64/include/asm/numa.h	2009-09-15 13:19:02.000000000 -0400
> +++ linux-2.6.31-mmotm-090914-0157/arch/ia64/include/asm/numa.h	2009-09-15 13:42:19.000000000 -0400
> @@ -22,8 +22,6 @@
>  
>  #include <asm/mmzone.h>
>  
> -#define NUMA_NO_NODE	-1
> -
>  extern u16 cpu_to_node_map[NR_CPUS] __cacheline_aligned;
>  extern cpumask_t node_to_cpu_mask[MAX_NUMNODES] __cacheline_aligned;
>  extern pg_data_t *pgdat_list[MAX_NUMNODES];
> Index: linux-2.6.31-mmotm-090914-0157/arch/x86/include/asm/topology.h
> ===================================================================
> --- linux-2.6.31-mmotm-090914-0157.orig/arch/x86/include/asm/topology.h	2009-09-15 13:19:02.000000000 -0400
> +++ linux-2.6.31-mmotm-090914-0157/arch/x86/include/asm/topology.h	2009-09-15 13:42:19.000000000 -0400
> @@ -35,11 +35,10 @@
>  # endif
>  #endif
>  
> -/* Node not present */
> -#define NUMA_NO_NODE	(-1)
> -
>  #ifdef CONFIG_NUMA
>  #include <linux/cpumask.h>
> +#include <linux/numa.h>
> +
>  #include <asm/mpspec.h>
>  
>  #ifdef CONFIG_X86_32
> Index: linux-2.6.31-mmotm-090914-0157/include/linux/numa.h
> ===================================================================
> --- linux-2.6.31-mmotm-090914-0157.orig/include/linux/numa.h	2009-09-15 13:19:02.000000000 -0400
> +++ linux-2.6.31-mmotm-090914-0157/include/linux/numa.h	2009-09-15 13:42:19.000000000 -0400
> @@ -10,4 +10,6 @@
>  
>  #define MAX_NUMNODES    (1 << NODES_SHIFT)
>  
> +#define	NUMA_NO_NODE	(-1)
> +
>  #endif /* _LINUX_NUMA_H */
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/11] hugetlb:  add generic definition of NUMA_NO_NODE
@ 2009-09-17 13:28     ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2009-09-17 13:28 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Randy Dunlap, Nishanth Aravamudan,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

On Tue, Sep 15, 2009 at 04:44:52PM -0400, Lee Schermerhorn wrote:
> [PATCH 5/11] - hugetlb:  promote NUMA_NO_NODE to generic constant
> 
> Against:  2.6.31-mmotm-090914-0157
> 
> New in V7 of series
> 
> Move definition of NUMA_NO_NODE from ia64 and x86_64 arch specific
> headers to generic header 'linux/numa.h' for use in generic code.
> NUMA_NO_NODE replaces bare '-1' where it's used in this series to
> indicate "no node id specified".  Ultimately, it can be used
> to replace the -1 elsewhere where it is used similarly.
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> Acked-by: David Rientjes <rientjes@google.com>
> 

Acked-by: Mel Gorman <mel@csn.ul.ie>

>  arch/ia64/include/asm/numa.h    |    2 --
>  arch/x86/include/asm/topology.h |    5 ++---
>  include/linux/numa.h            |    2 ++
>  3 files changed, 4 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6.31-mmotm-090914-0157/arch/ia64/include/asm/numa.h
> ===================================================================
> --- linux-2.6.31-mmotm-090914-0157.orig/arch/ia64/include/asm/numa.h	2009-09-15 13:19:02.000000000 -0400
> +++ linux-2.6.31-mmotm-090914-0157/arch/ia64/include/asm/numa.h	2009-09-15 13:42:19.000000000 -0400
> @@ -22,8 +22,6 @@
>  
>  #include <asm/mmzone.h>
>  
> -#define NUMA_NO_NODE	-1
> -
>  extern u16 cpu_to_node_map[NR_CPUS] __cacheline_aligned;
>  extern cpumask_t node_to_cpu_mask[MAX_NUMNODES] __cacheline_aligned;
>  extern pg_data_t *pgdat_list[MAX_NUMNODES];
> Index: linux-2.6.31-mmotm-090914-0157/arch/x86/include/asm/topology.h
> ===================================================================
> --- linux-2.6.31-mmotm-090914-0157.orig/arch/x86/include/asm/topology.h	2009-09-15 13:19:02.000000000 -0400
> +++ linux-2.6.31-mmotm-090914-0157/arch/x86/include/asm/topology.h	2009-09-15 13:42:19.000000000 -0400
> @@ -35,11 +35,10 @@
>  # endif
>  #endif
>  
> -/* Node not present */
> -#define NUMA_NO_NODE	(-1)
> -
>  #ifdef CONFIG_NUMA
>  #include <linux/cpumask.h>
> +#include <linux/numa.h>
> +
>  #include <asm/mpspec.h>
>  
>  #ifdef CONFIG_X86_32
> Index: linux-2.6.31-mmotm-090914-0157/include/linux/numa.h
> ===================================================================
> --- linux-2.6.31-mmotm-090914-0157.orig/include/linux/numa.h	2009-09-15 13:19:02.000000000 -0400
> +++ linux-2.6.31-mmotm-090914-0157/include/linux/numa.h	2009-09-15 13:42:19.000000000 -0400
> @@ -10,4 +10,6 @@
>  
>  #define MAX_NUMNODES    (1 << NODES_SHIFT)
>  
> +#define	NUMA_NO_NODE	(-1)
> +
>  #endif /* _LINUX_NUMA_H */
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/11] hugetlb:  rework hstate_next_node_* functions
  2009-09-15 20:43   ` Lee Schermerhorn
@ 2009-09-22 18:08     ` David Rientjes
  -1 siblings, 0 replies; 32+ messages in thread
From: David Rientjes @ 2009-09-22 18:08 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Randy Dunlap,
	Nishanth Aravamudan, Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 15 Sep 2009, Lee Schermerhorn wrote:

> Index: linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-mmotm-090914-0157.orig/mm/hugetlb.c	2009-09-15 13:23:01.000000000 -0400
> +++ linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c	2009-09-15 13:42:14.000000000 -0400
> @@ -622,6 +622,20 @@ static struct page *alloc_fresh_huge_pag
>  }
>  
>  /*
> + * common helper function for hstate_next_node_to_{alloc|free}.
> + * return next node in node_online_map, wrapping at end.
> + */
> +static int next_node_allowed(int nid)
> +{
> +	nid = next_node(nid, node_online_map);
> +	if (nid == MAX_NUMNODES)
> +		nid = first_node(node_online_map);
> +	VM_BUG_ON(nid >= MAX_NUMNODES);
> +
> +	return nid;
> +}
> +
> +/*
>   * Use a helper variable to find the next node and then
>   * copy it back to next_nid_to_alloc afterwards:
>   * otherwise there's a window in which a racer might
> @@ -634,12 +648,12 @@ static struct page *alloc_fresh_huge_pag
>   */
>  static int hstate_next_node_to_alloc(struct hstate *h)
>  {
> -	int next_nid;
> -	next_nid = next_node(h->next_nid_to_alloc, node_online_map);
> -	if (next_nid == MAX_NUMNODES)
> -		next_nid = first_node(node_online_map);
> +	int nid, next_nid;
> +
> +	nid = h->next_nid_to_alloc;
> +	next_nid = next_node_allowed(nid);
>  	h->next_nid_to_alloc = next_nid;
> -	return next_nid;
> +	return nid;
>  }
>  
>  static int alloc_fresh_huge_page(struct hstate *h)

I thought you had refactored this to drop next_nid entirely since gcc 
optimizes it away?

> @@ -649,15 +663,17 @@ static int alloc_fresh_huge_page(struct
>  	int next_nid;
>  	int ret = 0;
>  
> -	start_nid = h->next_nid_to_alloc;
> +	start_nid = hstate_next_node_to_alloc(h);
>  	next_nid = start_nid;
>  
>  	do {
>  		page = alloc_fresh_huge_page_node(h, next_nid);
> -		if (page)
> +		if (page) {
>  			ret = 1;
> +			break;
> +		}
>  		next_nid = hstate_next_node_to_alloc(h);
> -	} while (!page && next_nid != start_nid);
> +	} while (next_nid != start_nid);
>  
>  	if (ret)
>  		count_vm_event(HTLB_BUDDY_PGALLOC);
> @@ -668,17 +684,19 @@ static int alloc_fresh_huge_page(struct
>  }
>  
>  /*
> - * helper for free_pool_huge_page() - find next node
> - * from which to free a huge page
> + * helper for free_pool_huge_page() - return the next node
> + * from which to free a huge page.  Advance the next node id
> + * whether or not we find a free huge page to free so that the
> + * next attempt to free addresses the next node.
>   */
>  static int hstate_next_node_to_free(struct hstate *h)
>  {
> -	int next_nid;
> -	next_nid = next_node(h->next_nid_to_free, node_online_map);
> -	if (next_nid == MAX_NUMNODES)
> -		next_nid = first_node(node_online_map);
> +	int nid, next_nid;
> +
> +	nid = h->next_nid_to_free;
> +	next_nid = next_node_allowed(nid);
>  	h->next_nid_to_free = next_nid;
> -	return next_nid;
> +	return nid;
>  }
>  
>  /*

Ditto for next_nid.

> @@ -693,7 +711,7 @@ static int free_pool_huge_page(struct hs
>  	int next_nid;
>  	int ret = 0;
>  
> -	start_nid = h->next_nid_to_free;
> +	start_nid = hstate_next_node_to_free(h);
>  	next_nid = start_nid;
>  
>  	do {
> @@ -715,9 +733,10 @@ static int free_pool_huge_page(struct hs
>  			}
>  			update_and_free_page(h, page);
>  			ret = 1;
> +			break;
>  		}
>  		next_nid = hstate_next_node_to_free(h);
> -	} while (!ret && next_nid != start_nid);
> +	} while (next_nid != start_nid);
>  
>  	return ret;
>  }
> @@ -1028,10 +1047,9 @@ int __weak alloc_bootmem_huge_page(struc
>  		void *addr;
>  
>  		addr = __alloc_bootmem_node_nopanic(
> -				NODE_DATA(h->next_nid_to_alloc),
> +				NODE_DATA(hstate_next_node_to_alloc(h)),
>  				huge_page_size(h), huge_page_size(h), 0);
>  
> -		hstate_next_node_to_alloc(h);
>  		if (addr) {
>  			/*
>  			 * Use the beginning of the huge page to store the

Shouldn't that panic if hstate_next_node_to_alloc() returns a memoryless 
node since it uses node_online_map?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/11] hugetlb:  rework hstate_next_node_* functions
@ 2009-09-22 18:08     ` David Rientjes
  0 siblings, 0 replies; 32+ messages in thread
From: David Rientjes @ 2009-09-22 18:08 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Randy Dunlap,
	Nishanth Aravamudan, Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 15 Sep 2009, Lee Schermerhorn wrote:

> Index: linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-mmotm-090914-0157.orig/mm/hugetlb.c	2009-09-15 13:23:01.000000000 -0400
> +++ linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c	2009-09-15 13:42:14.000000000 -0400
> @@ -622,6 +622,20 @@ static struct page *alloc_fresh_huge_pag
>  }
>  
>  /*
> + * common helper function for hstate_next_node_to_{alloc|free}.
> + * return next node in node_online_map, wrapping at end.
> + */
> +static int next_node_allowed(int nid)
> +{
> +	nid = next_node(nid, node_online_map);
> +	if (nid == MAX_NUMNODES)
> +		nid = first_node(node_online_map);
> +	VM_BUG_ON(nid >= MAX_NUMNODES);
> +
> +	return nid;
> +}
> +
> +/*
>   * Use a helper variable to find the next node and then
>   * copy it back to next_nid_to_alloc afterwards:
>   * otherwise there's a window in which a racer might
> @@ -634,12 +648,12 @@ static struct page *alloc_fresh_huge_pag
>   */
>  static int hstate_next_node_to_alloc(struct hstate *h)
>  {
> -	int next_nid;
> -	next_nid = next_node(h->next_nid_to_alloc, node_online_map);
> -	if (next_nid == MAX_NUMNODES)
> -		next_nid = first_node(node_online_map);
> +	int nid, next_nid;
> +
> +	nid = h->next_nid_to_alloc;
> +	next_nid = next_node_allowed(nid);
>  	h->next_nid_to_alloc = next_nid;
> -	return next_nid;
> +	return nid;
>  }
>  
>  static int alloc_fresh_huge_page(struct hstate *h)

I thought you had refactored this to drop next_nid entirely since gcc 
optimizes it away?

> @@ -649,15 +663,17 @@ static int alloc_fresh_huge_page(struct
>  	int next_nid;
>  	int ret = 0;
>  
> -	start_nid = h->next_nid_to_alloc;
> +	start_nid = hstate_next_node_to_alloc(h);
>  	next_nid = start_nid;
>  
>  	do {
>  		page = alloc_fresh_huge_page_node(h, next_nid);
> -		if (page)
> +		if (page) {
>  			ret = 1;
> +			break;
> +		}
>  		next_nid = hstate_next_node_to_alloc(h);
> -	} while (!page && next_nid != start_nid);
> +	} while (next_nid != start_nid);
>  
>  	if (ret)
>  		count_vm_event(HTLB_BUDDY_PGALLOC);
> @@ -668,17 +684,19 @@ static int alloc_fresh_huge_page(struct
>  }
>  
>  /*
> - * helper for free_pool_huge_page() - find next node
> - * from which to free a huge page
> + * helper for free_pool_huge_page() - return the next node
> + * from which to free a huge page.  Advance the next node id
> + * whether or not we find a free huge page to free so that the
> + * next attempt to free addresses the next node.
>   */
>  static int hstate_next_node_to_free(struct hstate *h)
>  {
> -	int next_nid;
> -	next_nid = next_node(h->next_nid_to_free, node_online_map);
> -	if (next_nid == MAX_NUMNODES)
> -		next_nid = first_node(node_online_map);
> +	int nid, next_nid;
> +
> +	nid = h->next_nid_to_free;
> +	next_nid = next_node_allowed(nid);
>  	h->next_nid_to_free = next_nid;
> -	return next_nid;
> +	return nid;
>  }
>  
>  /*

Ditto for next_nid.

> @@ -693,7 +711,7 @@ static int free_pool_huge_page(struct hs
>  	int next_nid;
>  	int ret = 0;
>  
> -	start_nid = h->next_nid_to_free;
> +	start_nid = hstate_next_node_to_free(h);
>  	next_nid = start_nid;
>  
>  	do {
> @@ -715,9 +733,10 @@ static int free_pool_huge_page(struct hs
>  			}
>  			update_and_free_page(h, page);
>  			ret = 1;
> +			break;
>  		}
>  		next_nid = hstate_next_node_to_free(h);
> -	} while (!ret && next_nid != start_nid);
> +	} while (next_nid != start_nid);
>  
>  	return ret;
>  }
> @@ -1028,10 +1047,9 @@ int __weak alloc_bootmem_huge_page(struc
>  		void *addr;
>  
>  		addr = __alloc_bootmem_node_nopanic(
> -				NODE_DATA(h->next_nid_to_alloc),
> +				NODE_DATA(hstate_next_node_to_alloc(h)),
>  				huge_page_size(h), huge_page_size(h), 0);
>  
> -		hstate_next_node_to_alloc(h);
>  		if (addr) {
>  			/*
>  			 * Use the beginning of the huge page to store the

Shouldn't that panic if hstate_next_node_to_alloc() returns a memoryless 
node since it uses node_online_map?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/11] hugetlb:  rework hstate_next_node_* functions
  2009-09-22 18:08     ` David Rientjes
@ 2009-09-22 20:08       ` Lee Schermerhorn
  -1 siblings, 0 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-22 20:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Randy Dunlap,
	Nishanth Aravamudan, Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 2009-09-22 at 11:08 -0700, David Rientjes wrote: 
> On Tue, 15 Sep 2009, Lee Schermerhorn wrote:
> 
> > Index: linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c
> > ===================================================================
> > --- linux-2.6.31-mmotm-090914-0157.orig/mm/hugetlb.c	2009-09-15 13:23:01.000000000 -0400
> > +++ linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c	2009-09-15 13:42:14.000000000 -0400
> > @@ -622,6 +622,20 @@ static struct page *alloc_fresh_huge_pag
> >  }
> >  
> >  /*
> > + * common helper function for hstate_next_node_to_{alloc|free}.
> > + * return next node in node_online_map, wrapping at end.
> > + */
> > +static int next_node_allowed(int nid)
> > +{
> > +	nid = next_node(nid, node_online_map);
> > +	if (nid == MAX_NUMNODES)
> > +		nid = first_node(node_online_map);
> > +	VM_BUG_ON(nid >= MAX_NUMNODES);
> > +
> > +	return nid;
> > +}
> > +
> > +/*
> >   * Use a helper variable to find the next node and then
> >   * copy it back to next_nid_to_alloc afterwards:
> >   * otherwise there's a window in which a racer might
> > @@ -634,12 +648,12 @@ static struct page *alloc_fresh_huge_pag
> >   */
> >  static int hstate_next_node_to_alloc(struct hstate *h)
> >  {
> > -	int next_nid;
> > -	next_nid = next_node(h->next_nid_to_alloc, node_online_map);
> > -	if (next_nid == MAX_NUMNODES)
> > -		next_nid = first_node(node_online_map);
> > +	int nid, next_nid;
> > +
> > +	nid = h->next_nid_to_alloc;
> > +	next_nid = next_node_allowed(nid);
> >  	h->next_nid_to_alloc = next_nid;
> > -	return next_nid;
> > +	return nid;
> >  }
> >  
> >  static int alloc_fresh_huge_page(struct hstate *h)
> 
> I thought you had refactored this to drop next_nid entirely since gcc 
> optimizes it away?

Looks like I handled that in the subsequent patch.  Probably you had
commented about removing next_nid on that patch.

> >  /*
> > - * helper for free_pool_huge_page() - find next node
> > - * from which to free a huge page
> > + * helper for free_pool_huge_page() - return the next node
> > + * from which to free a huge page.  Advance the next node id
> > + * whether or not we find a free huge page to free so that the
> > + * next attempt to free addresses the next node.
> >   */
> >  static int hstate_next_node_to_free(struct hstate *h)
> >  {
> > -	int next_nid;
> > -	next_nid = next_node(h->next_nid_to_free, node_online_map);
> > -	if (next_nid == MAX_NUMNODES)
> > -		next_nid = first_node(node_online_map);
> > +	int nid, next_nid;
> > +
> > +	nid = h->next_nid_to_free;
> > +	next_nid = next_node_allowed(nid);
> >  	h->next_nid_to_free = next_nid;
> > -	return next_nid;
> > +	return nid;
> >  }
> >  
> >  /*
> 
> Ditto for next_nid.

Same.  

> 
> > @@ -693,7 +711,7 @@ static int free_pool_huge_page(struct hs
> >  	int next_nid;
> >  	int ret = 0;
> >  
> > -	start_nid = h->next_nid_to_free;
> > +	start_nid = hstate_next_node_to_free(h);
> >  	next_nid = start_nid;
> >  
> >  	do {
> > @@ -715,9 +733,10 @@ static int free_pool_huge_page(struct hs
> >  			}
> >  			update_and_free_page(h, page);
> >  			ret = 1;
> > +			break;
> >  		}
> >  		next_nid = hstate_next_node_to_free(h);
> > -	} while (!ret && next_nid != start_nid);
> > +	} while (next_nid != start_nid);
> >  
> >  	return ret;
> >  }
> > @@ -1028,10 +1047,9 @@ int __weak alloc_bootmem_huge_page(struc
> >  		void *addr;
> >  
> >  		addr = __alloc_bootmem_node_nopanic(
> > -				NODE_DATA(h->next_nid_to_alloc),
> > +				NODE_DATA(hstate_next_node_to_alloc(h)),
> >  				huge_page_size(h), huge_page_size(h), 0);
> >  
> > -		hstate_next_node_to_alloc(h);
> >  		if (addr) {
> >  			/*
> >  			 * Use the beginning of the huge page to store the
> 
> Shouldn't that panic if hstate_next_node_to_alloc() returns a memoryless 
> node since it uses node_online_map?

Well, the code has always been like this.  And, these allocs shouldn't
panic given a memoryless node.  The run time ones don't anyway.  If
'_THISNODE' is specified, it'll just fail with a NULL addr, else it's
walk the generic zonelist to find the first node that can provide the
requested page size.  Of course, we don't want that fallback when
populating the pools with persistent huge pages, so we always use the
THISNODE flag.

Having said that, I've only recently started to [try to] create the
gigabyte pages on my x86_64 [Shanghai] test system, but haven't been
able to allocate any GB pages.  2.6.31 seems to hang early in boot with
the command line options:  "hugepagesz=1GB, hugepages=16".  I've got
256GB of memory on this system, so 16GB shouldn't be a problem to find
at boot time.  Just started looking at this.

Meanwhile, with the full series, that online_node_map should be replaced
with node_states[N_HIGH_MEMORY] in a later patch.  I wanted to keep the
replacement of online_node_map with node_states[N_HIGH_MEMORY] to a
separate patch.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/11] hugetlb:  rework hstate_next_node_* functions
@ 2009-09-22 20:08       ` Lee Schermerhorn
  0 siblings, 0 replies; 32+ messages in thread
From: Lee Schermerhorn @ 2009-09-22 20:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Randy Dunlap,
	Nishanth Aravamudan, Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 2009-09-22 at 11:08 -0700, David Rientjes wrote: 
> On Tue, 15 Sep 2009, Lee Schermerhorn wrote:
> 
> > Index: linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c
> > ===================================================================
> > --- linux-2.6.31-mmotm-090914-0157.orig/mm/hugetlb.c	2009-09-15 13:23:01.000000000 -0400
> > +++ linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c	2009-09-15 13:42:14.000000000 -0400
> > @@ -622,6 +622,20 @@ static struct page *alloc_fresh_huge_pag
> >  }
> >  
> >  /*
> > + * common helper function for hstate_next_node_to_{alloc|free}.
> > + * return next node in node_online_map, wrapping at end.
> > + */
> > +static int next_node_allowed(int nid)
> > +{
> > +	nid = next_node(nid, node_online_map);
> > +	if (nid == MAX_NUMNODES)
> > +		nid = first_node(node_online_map);
> > +	VM_BUG_ON(nid >= MAX_NUMNODES);
> > +
> > +	return nid;
> > +}
> > +
> > +/*
> >   * Use a helper variable to find the next node and then
> >   * copy it back to next_nid_to_alloc afterwards:
> >   * otherwise there's a window in which a racer might
> > @@ -634,12 +648,12 @@ static struct page *alloc_fresh_huge_pag
> >   */
> >  static int hstate_next_node_to_alloc(struct hstate *h)
> >  {
> > -	int next_nid;
> > -	next_nid = next_node(h->next_nid_to_alloc, node_online_map);
> > -	if (next_nid == MAX_NUMNODES)
> > -		next_nid = first_node(node_online_map);
> > +	int nid, next_nid;
> > +
> > +	nid = h->next_nid_to_alloc;
> > +	next_nid = next_node_allowed(nid);
> >  	h->next_nid_to_alloc = next_nid;
> > -	return next_nid;
> > +	return nid;
> >  }
> >  
> >  static int alloc_fresh_huge_page(struct hstate *h)
> 
> I thought you had refactored this to drop next_nid entirely since gcc 
> optimizes it away?

Looks like I handled that in the subsequent patch.  Probably you had
commented about removing next_nid on that patch.

> >  /*
> > - * helper for free_pool_huge_page() - find next node
> > - * from which to free a huge page
> > + * helper for free_pool_huge_page() - return the next node
> > + * from which to free a huge page.  Advance the next node id
> > + * whether or not we find a free huge page to free so that the
> > + * next attempt to free addresses the next node.
> >   */
> >  static int hstate_next_node_to_free(struct hstate *h)
> >  {
> > -	int next_nid;
> > -	next_nid = next_node(h->next_nid_to_free, node_online_map);
> > -	if (next_nid == MAX_NUMNODES)
> > -		next_nid = first_node(node_online_map);
> > +	int nid, next_nid;
> > +
> > +	nid = h->next_nid_to_free;
> > +	next_nid = next_node_allowed(nid);
> >  	h->next_nid_to_free = next_nid;
> > -	return next_nid;
> > +	return nid;
> >  }
> >  
> >  /*
> 
> Ditto for next_nid.

Same.  

> 
> > @@ -693,7 +711,7 @@ static int free_pool_huge_page(struct hs
> >  	int next_nid;
> >  	int ret = 0;
> >  
> > -	start_nid = h->next_nid_to_free;
> > +	start_nid = hstate_next_node_to_free(h);
> >  	next_nid = start_nid;
> >  
> >  	do {
> > @@ -715,9 +733,10 @@ static int free_pool_huge_page(struct hs
> >  			}
> >  			update_and_free_page(h, page);
> >  			ret = 1;
> > +			break;
> >  		}
> >  		next_nid = hstate_next_node_to_free(h);
> > -	} while (!ret && next_nid != start_nid);
> > +	} while (next_nid != start_nid);
> >  
> >  	return ret;
> >  }
> > @@ -1028,10 +1047,9 @@ int __weak alloc_bootmem_huge_page(struc
> >  		void *addr;
> >  
> >  		addr = __alloc_bootmem_node_nopanic(
> > -				NODE_DATA(h->next_nid_to_alloc),
> > +				NODE_DATA(hstate_next_node_to_alloc(h)),
> >  				huge_page_size(h), huge_page_size(h), 0);
> >  
> > -		hstate_next_node_to_alloc(h);
> >  		if (addr) {
> >  			/*
> >  			 * Use the beginning of the huge page to store the
> 
> Shouldn't that panic if hstate_next_node_to_alloc() returns a memoryless 
> node since it uses node_online_map?

Well, the code has always been like this.  And, these allocs shouldn't
panic given a memoryless node.  The run time ones don't anyway.  If
'_THISNODE' is specified, it'll just fail with a NULL addr, else it's
walk the generic zonelist to find the first node that can provide the
requested page size.  Of course, we don't want that fallback when
populating the pools with persistent huge pages, so we always use the
THISNODE flag.

Having said that, I've only recently started to [try to] create the
gigabyte pages on my x86_64 [Shanghai] test system, but haven't been
able to allocate any GB pages.  2.6.31 seems to hang early in boot with
the command line options:  "hugepagesz=1GB, hugepages=16".  I've got
256GB of memory on this system, so 16GB shouldn't be a problem to find
at boot time.  Just started looking at this.

Meanwhile, with the full series, that online_node_map should be replaced
with node_states[N_HIGH_MEMORY] in a later patch.  I wanted to keep the
replacement of online_node_map with node_states[N_HIGH_MEMORY] to a
separate patch.




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/11] hugetlb:  rework hstate_next_node_* functions
  2009-09-22 20:08       ` Lee Schermerhorn
@ 2009-09-22 20:13         ` David Rientjes
  -1 siblings, 0 replies; 32+ messages in thread
From: David Rientjes @ 2009-09-22 20:13 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Randy Dunlap,
	Nishanth Aravamudan, Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 22 Sep 2009, Lee Schermerhorn wrote:

> > >  static int hstate_next_node_to_alloc(struct hstate *h)
> > >  {
> > > -	int next_nid;
> > > -	next_nid = next_node(h->next_nid_to_alloc, node_online_map);
> > > -	if (next_nid == MAX_NUMNODES)
> > > -		next_nid = first_node(node_online_map);
> > > +	int nid, next_nid;
> > > +
> > > +	nid = h->next_nid_to_alloc;
> > > +	next_nid = next_node_allowed(nid);
> > >  	h->next_nid_to_alloc = next_nid;
> > > -	return next_nid;
> > > +	return nid;
> > >  }
> > >  
> > >  static int alloc_fresh_huge_page(struct hstate *h)
> > 
> > I thought you had refactored this to drop next_nid entirely since gcc 
> > optimizes it away?
> 
> Looks like I handled that in the subsequent patch.  Probably you had
> commented about removing next_nid on that patch.
> 

Ah, I see it in 2/11, thanks.

> > > @@ -693,7 +711,7 @@ static int free_pool_huge_page(struct hs
> > >  	int next_nid;
> > >  	int ret = 0;
> > >  
> > > -	start_nid = h->next_nid_to_free;
> > > +	start_nid = hstate_next_node_to_free(h);
> > >  	next_nid = start_nid;
> > >  
> > >  	do {
> > > @@ -715,9 +733,10 @@ static int free_pool_huge_page(struct hs
> > >  			}
> > >  			update_and_free_page(h, page);
> > >  			ret = 1;
> > > +			break;
> > >  		}
> > >  		next_nid = hstate_next_node_to_free(h);
> > > -	} while (!ret && next_nid != start_nid);
> > > +	} while (next_nid != start_nid);
> > >  
> > >  	return ret;
> > >  }
> > > @@ -1028,10 +1047,9 @@ int __weak alloc_bootmem_huge_page(struc
> > >  		void *addr;
> > >  
> > >  		addr = __alloc_bootmem_node_nopanic(
> > > -				NODE_DATA(h->next_nid_to_alloc),
> > > +				NODE_DATA(hstate_next_node_to_alloc(h)),
> > >  				huge_page_size(h), huge_page_size(h), 0);
> > >  
> > > -		hstate_next_node_to_alloc(h);
> > >  		if (addr) {
> > >  			/*
> > >  			 * Use the beginning of the huge page to store the
> > 
> > Shouldn't that panic if hstate_next_node_to_alloc() returns a memoryless 
> > node since it uses node_online_map?
> 
> Well, the code has always been like this.  And, these allocs shouldn't
> panic given a memoryless node.  The run time ones don't anyway.  If
> '_THISNODE' is specified, it'll just fail with a NULL addr, else it's
> walk the generic zonelist to find the first node that can provide the
> requested page size.  Of course, we don't want that fallback when
> populating the pools with persistent huge pages, so we always use the
> THISNODE flag.
> 

Whether NODE_DATA() exists for a memoryless node is arch-dependent, I 
think, so the panic I was referring to was a NULL pointer in bootmem.  I 
think you're safe with the conversion to N_HIGH_MEMORY in patch 9/11 upon 
further inspection, though.

> Having said that, I've only recently started to [try to] create the
> gigabyte pages on my x86_64 [Shanghai] test system, but haven't been
> able to allocate any GB pages.  2.6.31 seems to hang early in boot with
> the command line options:  "hugepagesz=1GB, hugepages=16".  I've got
> 256GB of memory on this system, so 16GB shouldn't be a problem to find
> at boot time.  Just started looking at this.
> 

I can try to reproduce that on one of my systems too, I've never tried it 
before.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/11] hugetlb:  rework hstate_next_node_* functions
@ 2009-09-22 20:13         ` David Rientjes
  0 siblings, 0 replies; 32+ messages in thread
From: David Rientjes @ 2009-09-22 20:13 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Randy Dunlap,
	Nishanth Aravamudan, Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 22 Sep 2009, Lee Schermerhorn wrote:

> > >  static int hstate_next_node_to_alloc(struct hstate *h)
> > >  {
> > > -	int next_nid;
> > > -	next_nid = next_node(h->next_nid_to_alloc, node_online_map);
> > > -	if (next_nid == MAX_NUMNODES)
> > > -		next_nid = first_node(node_online_map);
> > > +	int nid, next_nid;
> > > +
> > > +	nid = h->next_nid_to_alloc;
> > > +	next_nid = next_node_allowed(nid);
> > >  	h->next_nid_to_alloc = next_nid;
> > > -	return next_nid;
> > > +	return nid;
> > >  }
> > >  
> > >  static int alloc_fresh_huge_page(struct hstate *h)
> > 
> > I thought you had refactored this to drop next_nid entirely since gcc 
> > optimizes it away?
> 
> Looks like I handled that in the subsequent patch.  Probably you had
> commented about removing next_nid on that patch.
> 

Ah, I see it in 2/11, thanks.

> > > @@ -693,7 +711,7 @@ static int free_pool_huge_page(struct hs
> > >  	int next_nid;
> > >  	int ret = 0;
> > >  
> > > -	start_nid = h->next_nid_to_free;
> > > +	start_nid = hstate_next_node_to_free(h);
> > >  	next_nid = start_nid;
> > >  
> > >  	do {
> > > @@ -715,9 +733,10 @@ static int free_pool_huge_page(struct hs
> > >  			}
> > >  			update_and_free_page(h, page);
> > >  			ret = 1;
> > > +			break;
> > >  		}
> > >  		next_nid = hstate_next_node_to_free(h);
> > > -	} while (!ret && next_nid != start_nid);
> > > +	} while (next_nid != start_nid);
> > >  
> > >  	return ret;
> > >  }
> > > @@ -1028,10 +1047,9 @@ int __weak alloc_bootmem_huge_page(struc
> > >  		void *addr;
> > >  
> > >  		addr = __alloc_bootmem_node_nopanic(
> > > -				NODE_DATA(h->next_nid_to_alloc),
> > > +				NODE_DATA(hstate_next_node_to_alloc(h)),
> > >  				huge_page_size(h), huge_page_size(h), 0);
> > >  
> > > -		hstate_next_node_to_alloc(h);
> > >  		if (addr) {
> > >  			/*
> > >  			 * Use the beginning of the huge page to store the
> > 
> > Shouldn't that panic if hstate_next_node_to_alloc() returns a memoryless 
> > node since it uses node_online_map?
> 
> Well, the code has always been like this.  And, these allocs shouldn't
> panic given a memoryless node.  The run time ones don't anyway.  If
> '_THISNODE' is specified, it'll just fail with a NULL addr, else it's
> walk the generic zonelist to find the first node that can provide the
> requested page size.  Of course, we don't want that fallback when
> populating the pools with persistent huge pages, so we always use the
> THISNODE flag.
> 

Whether NODE_DATA() exists for a memoryless node is arch-dependent, I 
think, so the panic I was referring to was a NULL pointer in bootmem.  I 
think you're safe with the conversion to N_HIGH_MEMORY in patch 9/11 upon 
further inspection, though.

> Having said that, I've only recently started to [try to] create the
> gigabyte pages on my x86_64 [Shanghai] test system, but haven't been
> able to allocate any GB pages.  2.6.31 seems to hang early in boot with
> the command line options:  "hugepagesz=1GB, hugepages=16".  I've got
> 256GB of memory on this system, so 16GB shouldn't be a problem to find
> at boot time.  Just started looking at this.
> 

I can try to reproduce that on one of my systems too, I've never tried it 
before.

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2009-09-22 20:13 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-15 20:43 [PATCH 0/11] hugetlb: V7 constrain allocation/free based on task mempolicy Lee Schermerhorn
2009-09-15 20:43 ` [PATCH 1/11] hugetlb: rework hstate_next_node_* functions Lee Schermerhorn
2009-09-15 20:43   ` Lee Schermerhorn
2009-09-22 18:08   ` David Rientjes
2009-09-22 18:08     ` David Rientjes
2009-09-22 20:08     ` Lee Schermerhorn
2009-09-22 20:08       ` Lee Schermerhorn
2009-09-22 20:13       ` David Rientjes
2009-09-22 20:13         ` David Rientjes
2009-09-15 20:44 ` [PATCH 2/11] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns Lee Schermerhorn
2009-09-15 20:44   ` Lee Schermerhorn
2009-09-15 20:44 ` [PATCH 3/11] hugetlb: introduce alloc_nodemask_of_node Lee Schermerhorn
2009-09-15 20:44   ` Lee Schermerhorn
2009-09-15 20:44 ` [PATCH 4/11] hugetlb: derive huge pages nodes allowed from task mempolicy Lee Schermerhorn
2009-09-15 20:44   ` Lee Schermerhorn
2009-09-15 20:44 ` [PATCH 5/11] hugetlb: add generic definition of NUMA_NO_NODE Lee Schermerhorn
2009-09-15 20:44   ` Lee Schermerhorn
2009-09-17 13:28   ` Mel Gorman
2009-09-17 13:28     ` Mel Gorman
2009-09-15 20:44 ` [PATCH 6/11] hugetlb: add per node hstate attributes Lee Schermerhorn
2009-09-15 20:45 ` [PATCH 7/11] hugetlb: update hugetlb documentation for mempolicy based management Lee Schermerhorn
2009-09-16 13:37   ` Mel Gorman
2009-09-15 20:45 ` [PATCH 8/11] hugetlb: Optionally use mempolicy for persistent huge page allocation Lee Schermerhorn
2009-09-15 20:45   ` Lee Schermerhorn
2009-09-16 13:48   ` Mel Gorman
2009-09-16 13:48     ` Mel Gorman
2009-09-15 20:45 ` [PATCH 9/11] hugetlb: use only nodes with memory for huge pages Lee Schermerhorn
2009-09-15 20:45   ` Lee Schermerhorn
2009-09-15 20:45 ` [PATCH 10/11] hugetlb: handle memory hot-plug events Lee Schermerhorn
2009-09-15 20:45   ` Lee Schermerhorn
2009-09-15 20:45 ` [PATCH 11/11] hugetlb: offload per node attribute registrations Lee Schermerhorn
2009-09-15 20:45   ` Lee Schermerhorn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.