[PATCH 0/5] hugetlb: numa control of persistent huge pages alloc/free

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5] hugetlb: numa control of persistent huge pages alloc/free
@ 2009-08-24 19:24 Lee Schermerhorn
  2009-08-24 19:25 ` [PATCH 1/5] hugetlb: rework hstate_next_node_* functions Lee Schermerhorn
                   ` (4 more replies)
  0 siblings, 5 replies; 51+ messages in thread
From: Lee Schermerhorn @ 2009-08-24 19:24 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

PATCH 0/5 hugetlb: numa control of persistent huge pages alloc/free

Against:  2.6.31-rc6-mmotm-090820-1918

This is V4 of a series of patches to provide control over the location
of the allocation and freeing of persistent huge pages on a NUMA
platform.    This series uses the task NUMA mempolicy of the task
modifying "nr_hugepages" to constrain the affected nodes.  This
method is based on Mel Gorman's suggestion to use task mempolicy.
One of the benefits of this method is that it does not *require*
modification to hugeadm(8) to use this feature.  One of the possible
downsides is that task mempolicy is limited by cpuset constraints.

V4 add a subset of the hugepages sysfs attributes to each per
node system device directory under:

	/sys/devices/node/node[0-9]*/hugepages.

The per node attibutes allow direct assignment of a huge page
count on a specific node, regardless of the task's mempolicy or
cpuset constraints.

Note, I haven't implemented a boot time parameter to constrain the
boot time allocation of huge pages.  This can be added if anyone feels
strongly that it is required.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 1/5] hugetlb:  rework hstate_next_node_* functions
  2009-08-24 19:24 [PATCH 0/5] hugetlb: numa control of persistent huge pages alloc/free Lee Schermerhorn
@ 2009-08-24 19:25 ` Lee Schermerhorn
  2009-08-25  8:10     ` David Rientjes
  2009-08-24 19:26 ` [PATCH 2/5] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns Lee Schermerhorn
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 51+ messages in thread
From: Lee Schermerhorn @ 2009-08-24 19:25 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 1/5] hugetlb:  rework hstate_next_node* functions

Against: 2.6.31-rc6-mmotm-090820-1918

V2:
+ cleaned up comments, removed some deemed unnecessary,
  add some suggested by review
+ removed check for !current in huge_mpol_nodes_allowed().
+ added 'current->comm' to warning message in huge_mpol_nodes_allowed().
+ added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to
  catch out of range node id.
+ add examples to patch description

V3:
+ factored this "cleanup" patch out of V2 patch 2/3
+ moved ahead of patch to add nodes_allowed mask to alloc funcs
  as this patch is somewhat independent from using task mempolicy
  to control huge page allocation and freeing.

Modify the hstate_next_node* functions to allow them to be called to
obtain the "start_nid".  Then, whereas prior to this patch we
unconditionally called hstate_next_node_to_{alloc|free}(), whether
or not we successfully allocated/freed a huge page on the node,
now we only call these functions on failure to alloc/free to advance
to next allowed node.

Factor out the next_node_allowed() function to handle wrap at end
of node_online_map.  In this version, the allowed nodes include all 
of the online nodes.

Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/hugetlb.c |   70 +++++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 45 insertions(+), 25 deletions(-)

Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-24 12:12:44.000000000 -0400
+++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-24 12:12:46.000000000 -0400
@@ -622,6 +622,20 @@ static struct page *alloc_fresh_huge_pag
 }
 
 /*
+ * common helper function for hstate_next_node_to_{alloc|free}.
+ * return next node in node_online_map, wrapping at end.
+ */
+static int next_node_allowed(int nid)
+{
+	nid = next_node(nid, node_online_map);
+	if (nid == MAX_NUMNODES)
+		nid = first_node(node_online_map);
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+
+	return nid;
+}
+
+/*
  * Use a helper variable to find the next node and then
  * copy it back to next_nid_to_alloc afterwards:
  * otherwise there's a window in which a racer might
@@ -634,12 +648,12 @@ static struct page *alloc_fresh_huge_pag
  */
 static int hstate_next_node_to_alloc(struct hstate *h)
 {
-	int next_nid;
-	next_nid = next_node(h->next_nid_to_alloc, node_online_map);
-	if (next_nid == MAX_NUMNODES)
-		next_nid = first_node(node_online_map);
+	int nid, next_nid;
+
+	nid = h->next_nid_to_alloc;
+	next_nid = next_node_allowed(nid);
 	h->next_nid_to_alloc = next_nid;
-	return next_nid;
+	return nid;
 }
 
 static int alloc_fresh_huge_page(struct hstate *h)
@@ -649,15 +663,17 @@ static int alloc_fresh_huge_page(struct
 	int next_nid;
 	int ret = 0;
 
-	start_nid = h->next_nid_to_alloc;
+	start_nid = hstate_next_node_to_alloc(h);
 	next_nid = start_nid;
 
 	do {
 		page = alloc_fresh_huge_page_node(h, next_nid);
-		if (page)
+		if (page) {
 			ret = 1;
+			break;
+		}
 		next_nid = hstate_next_node_to_alloc(h);
-	} while (!page && next_nid != start_nid);
+	} while (next_nid != start_nid);
 
 	if (ret)
 		count_vm_event(HTLB_BUDDY_PGALLOC);
@@ -668,17 +684,19 @@ static int alloc_fresh_huge_page(struct
 }
 
 /*
- * helper for free_pool_huge_page() - find next node
- * from which to free a huge page
+ * helper for free_pool_huge_page() - return the next node
+ * from which to free a huge page.  Advance the next node id
+ * whether or not we find a free huge page to free so that the
+ * next attempt to free addresses the next node.
  */
 static int hstate_next_node_to_free(struct hstate *h)
 {
-	int next_nid;
-	next_nid = next_node(h->next_nid_to_free, node_online_map);
-	if (next_nid == MAX_NUMNODES)
-		next_nid = first_node(node_online_map);
+	int nid, next_nid;
+
+	nid = h->next_nid_to_free;
+	next_nid = next_node_allowed(nid);
 	h->next_nid_to_free = next_nid;
-	return next_nid;
+	return nid;
 }
 
 /*
@@ -693,7 +711,7 @@ static int free_pool_huge_page(struct hs
 	int next_nid;
 	int ret = 0;
 
-	start_nid = h->next_nid_to_free;
+	start_nid = hstate_next_node_to_free(h);
 	next_nid = start_nid;
 
 	do {
@@ -715,9 +733,10 @@ static int free_pool_huge_page(struct hs
 			}
 			update_and_free_page(h, page);
 			ret = 1;
+			break;
 		}
 		next_nid = hstate_next_node_to_free(h);
-	} while (!ret && next_nid != start_nid);
+	} while (next_nid != start_nid);
 
 	return ret;
 }
@@ -1028,10 +1047,9 @@ int __weak alloc_bootmem_huge_page(struc
 		void *addr;
 
 		addr = __alloc_bootmem_node_nopanic(
-				NODE_DATA(h->next_nid_to_alloc),
+				NODE_DATA(hstate_next_node_to_alloc(h)),
 				huge_page_size(h), huge_page_size(h), 0);
 
-		hstate_next_node_to_alloc(h);
 		if (addr) {
 			/*
 			 * Use the beginning of the huge page to store the
@@ -1167,29 +1185,31 @@ static int adjust_pool_surplus(struct hs
 	VM_BUG_ON(delta != -1 && delta != 1);
 
 	if (delta < 0)
-		start_nid = h->next_nid_to_alloc;
+		start_nid = hstate_next_node_to_alloc(h);
 	else
-		start_nid = h->next_nid_to_free;
+		start_nid = hstate_next_node_to_free(h);
 	next_nid = start_nid;
 
 	do {
 		int nid = next_nid;
 		if (delta < 0)  {
-			next_nid = hstate_next_node_to_alloc(h);
 			/*
 			 * To shrink on this node, there must be a surplus page
 			 */
-			if (!h->surplus_huge_pages_node[nid])
+			if (!h->surplus_huge_pages_node[nid]) {
+				next_nid = hstate_next_node_to_alloc(h);
 				continue;
+			}
 		}
 		if (delta > 0) {
-			next_nid = hstate_next_node_to_free(h);
 			/*
 			 * Surplus cannot exceed the total number of pages
 			 */
 			if (h->surplus_huge_pages_node[nid] >=
-						h->nr_huge_pages_node[nid])
+						h->nr_huge_pages_node[nid]) {
+				next_nid = hstate_next_node_to_free(h);
 				continue;
+			}
 		}
 
 		h->surplus_huge_pages += delta;

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 2/5] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns
  2009-08-24 19:24 [PATCH 0/5] hugetlb: numa control of persistent huge pages alloc/free Lee Schermerhorn
  2009-08-24 19:25 ` [PATCH 1/5] hugetlb: rework hstate_next_node_* functions Lee Schermerhorn
@ 2009-08-24 19:26 ` Lee Schermerhorn
  2009-08-25  8:16     ` David Rientjes
  2009-08-24 19:27 ` [PATCH 3/5] hugetlb: derive huge pages nodes allowed from task mempolicy Lee Schermerhorn
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 51+ messages in thread
From: Lee Schermerhorn @ 2009-08-24 19:26 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 2/4] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns

Against: 2.6.31-rc6-mmotm-090820-1918

V3:
+ moved this patch to after the "rework" of hstate_next_node_to_...
  functions as this patch is more specific to using task mempolicy
  to control huge page allocation and freeing.

In preparation for constraining huge page allocation and freeing by the
controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer
to the allocate, free and surplus adjustment functions.  For now, pass
NULL to indicate default behavior--i.e., use node_online_map.  A
subsqeuent patch will derive a non-default mask from the controlling 
task's numa mempolicy.

Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/hugetlb.c |  102 ++++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 67 insertions(+), 35 deletions(-)

Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-24 12:12:46.000000000 -0400
+++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-24 12:12:50.000000000 -0400
@@ -622,19 +622,29 @@ static struct page *alloc_fresh_huge_pag
 }
 
 /*
- * common helper function for hstate_next_node_to_{alloc|free}.
- * return next node in node_online_map, wrapping at end.
+ * common helper functions for hstate_next_node_to_{alloc|free}.
+ * We may have allocated or freed a huge pages based on a different
+ * nodes_allowed, previously, so h->next_node_to_{alloc|free} might
+ * be outside of *nodes_allowed.  Ensure that we use the next
+ * allowed node for alloc or free.
  */
-static int next_node_allowed(int nid)
+static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
 {
-	nid = next_node(nid, node_online_map);
+	nid = next_node(nid, *nodes_allowed);
 	if (nid == MAX_NUMNODES)
-		nid = first_node(node_online_map);
+		nid = first_node(*nodes_allowed);
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 
 	return nid;
 }
 
+static int this_node_allowed(int nid, nodemask_t *nodes_allowed)
+{
+	if (!node_isset(nid, *nodes_allowed))
+		nid = next_node_allowed(nid, nodes_allowed);
+	return nid;
+}
+
 /*
  * Use a helper variable to find the next node and then
  * copy it back to next_nid_to_alloc afterwards:
@@ -642,28 +652,34 @@ static int next_node_allowed(int nid)
  * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
  * But we don't need to use a spin_lock here: it really
  * doesn't matter if occasionally a racer chooses the
- * same nid as we do.  Move nid forward in the mask even
- * if we just successfully allocated a hugepage so that
- * the next caller gets hugepages on the next node.
+ * same nid as we do.  Move nid forward in the mask whether
+ * or not we just successfully allocated a hugepage so that
+ * the next allocation addresses the next node.
  */
-static int hstate_next_node_to_alloc(struct hstate *h)
+static int hstate_next_node_to_alloc(struct hstate *h,
+					nodemask_t *nodes_allowed)
 {
 	int nid, next_nid;
 
-	nid = h->next_nid_to_alloc;
-	next_nid = next_node_allowed(nid);
+	if (!nodes_allowed)
+		nodes_allowed = &node_online_map;
+
+	nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed);
+
+	next_nid = next_node_allowed(nid, nodes_allowed);
 	h->next_nid_to_alloc = next_nid;
+
 	return nid;
 }
 
-static int alloc_fresh_huge_page(struct hstate *h)
+static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
 {
 	struct page *page;
 	int start_nid;
 	int next_nid;
 	int ret = 0;
 
-	start_nid = hstate_next_node_to_alloc(h);
+	start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
 	next_nid = start_nid;
 
 	do {
@@ -672,7 +688,7 @@ static int alloc_fresh_huge_page(struct
 			ret = 1;
 			break;
 		}
-		next_nid = hstate_next_node_to_alloc(h);
+		next_nid = hstate_next_node_to_alloc(h, nodes_allowed);
 	} while (next_nid != start_nid);
 
 	if (ret)
@@ -689,13 +705,18 @@ static int alloc_fresh_huge_page(struct
  * whether or not we find a free huge page to free so that the
  * next attempt to free addresses the next node.
  */
-static int hstate_next_node_to_free(struct hstate *h)
+static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 {
 	int nid, next_nid;
 
-	nid = h->next_nid_to_free;
-	next_nid = next_node_allowed(nid);
+	if (!nodes_allowed)
+		nodes_allowed = &node_online_map;
+
+	nid = this_node_allowed(h->next_nid_to_free, nodes_allowed);
+
+	next_nid = next_node_allowed(nid, nodes_allowed);
 	h->next_nid_to_free = next_nid;
+
 	return nid;
 }
 
@@ -705,13 +726,14 @@ static int hstate_next_node_to_free(stru
  * balanced over allowed nodes.
  * Called with hugetlb_lock locked.
  */
-static int free_pool_huge_page(struct hstate *h, bool acct_surplus)
+static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
+							 bool acct_surplus)
 {
 	int start_nid;
 	int next_nid;
 	int ret = 0;
 
-	start_nid = hstate_next_node_to_free(h);
+	start_nid = hstate_next_node_to_free(h, nodes_allowed);
 	next_nid = start_nid;
 
 	do {
@@ -735,7 +757,7 @@ static int free_pool_huge_page(struct hs
 			ret = 1;
 			break;
 		}
-		next_nid = hstate_next_node_to_free(h);
+		next_nid = hstate_next_node_to_free(h, nodes_allowed);
 	} while (next_nid != start_nid);
 
 	return ret;
@@ -937,7 +959,7 @@ static void return_unused_surplus_pages(
 	 * on-line nodes for us and will handle the hstate accounting.
 	 */
 	while (nr_pages--) {
-		if (!free_pool_huge_page(h, 1))
+		if (!free_pool_huge_page(h, NULL, 1))
 			break;
 	}
 }
@@ -1047,7 +1069,7 @@ int __weak alloc_bootmem_huge_page(struc
 		void *addr;
 
 		addr = __alloc_bootmem_node_nopanic(
-				NODE_DATA(hstate_next_node_to_alloc(h)),
+				NODE_DATA(hstate_next_node_to_alloc(h, NULL)),
 				huge_page_size(h), huge_page_size(h), 0);
 
 		if (addr) {
@@ -1102,7 +1124,7 @@ static void __init hugetlb_hstate_alloc_
 		if (h->order >= MAX_ORDER) {
 			if (!alloc_bootmem_huge_page(h))
 				break;
-		} else if (!alloc_fresh_huge_page(h))
+		} else if (!alloc_fresh_huge_page(h, NULL))
 			break;
 	}
 	h->max_huge_pages = i;
@@ -1144,16 +1166,22 @@ static void __init report_hugepages(void
 }
 
 #ifdef CONFIG_HIGHMEM
-static void try_to_free_low(struct hstate *h, unsigned long count)
+static void try_to_free_low(struct hstate *h, unsigned long count,
+						nodemask_t *nodes_allowed)
 {
 	int i;
 
 	if (h->order >= MAX_ORDER)
 		return;
 
+	if (!nodes_allowed)
+		nodes_allowed = &node_online_map;
+
 	for (i = 0; i < MAX_NUMNODES; ++i) {
 		struct page *page, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
+		if (!node_isset(i, *nodes_allowed))
+			continue;
 		list_for_each_entry_safe(page, next, freel, lru) {
 			if (count >= h->nr_huge_pages)
 				return;
@@ -1167,7 +1195,8 @@ static void try_to_free_low(struct hstat
 	}
 }
 #else
-static inline void try_to_free_low(struct hstate *h, unsigned long count)
+static inline void try_to_free_low(struct hstate *h, unsigned long count,
+						nodemask_t *nodes_allowed)
 {
 }
 #endif
@@ -1177,7 +1206,8 @@ static inline void try_to_free_low(struc
  * balanced by operating on them in a round-robin fashion.
  * Returns 1 if an adjustment was made.
  */
-static int adjust_pool_surplus(struct hstate *h, int delta)
+static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
+				int delta)
 {
 	int start_nid, next_nid;
 	int ret = 0;
@@ -1185,9 +1215,9 @@ static int adjust_pool_surplus(struct hs
 	VM_BUG_ON(delta != -1 && delta != 1);
 
 	if (delta < 0)
-		start_nid = hstate_next_node_to_alloc(h);
+		start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
 	else
-		start_nid = hstate_next_node_to_free(h);
+		start_nid = hstate_next_node_to_free(h, nodes_allowed);
 	next_nid = start_nid;
 
 	do {
@@ -1197,7 +1227,8 @@ static int adjust_pool_surplus(struct hs
 			 * To shrink on this node, there must be a surplus page
 			 */
 			if (!h->surplus_huge_pages_node[nid]) {
-				next_nid = hstate_next_node_to_alloc(h);
+				next_nid = hstate_next_node_to_alloc(h,
+								nodes_allowed);
 				continue;
 			}
 		}
@@ -1207,7 +1238,8 @@ static int adjust_pool_surplus(struct hs
 			 */
 			if (h->surplus_huge_pages_node[nid] >=
 						h->nr_huge_pages_node[nid]) {
-				next_nid = hstate_next_node_to_free(h);
+				next_nid = hstate_next_node_to_free(h,
+								nodes_allowed);
 				continue;
 			}
 		}
@@ -1242,7 +1274,7 @@ static unsigned long set_max_huge_pages(
 	 */
 	spin_lock(&hugetlb_lock);
 	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, -1))
+		if (!adjust_pool_surplus(h, NULL, -1))
 			break;
 	}
 
@@ -1253,7 +1285,7 @@ static unsigned long set_max_huge_pages(
 		 * and reducing the surplus.
 		 */
 		spin_unlock(&hugetlb_lock);
-		ret = alloc_fresh_huge_page(h);
+		ret = alloc_fresh_huge_page(h, NULL);
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -1277,13 +1309,13 @@ static unsigned long set_max_huge_pages(
 	 */
 	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
 	min_count = max(count, min_count);
-	try_to_free_low(h, min_count);
+	try_to_free_low(h, min_count, NULL);
 	while (min_count < persistent_huge_pages(h)) {
-		if (!free_pool_huge_page(h, 0))
+		if (!free_pool_huge_page(h, NULL, 0))
 			break;
 	}
 	while (count < persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, 1))
+		if (!adjust_pool_surplus(h, NULL, 1))
 			break;
 	}
 out:

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 3/5] hugetlb:  derive huge pages nodes allowed from task mempolicy
  2009-08-24 19:24 [PATCH 0/5] hugetlb: numa control of persistent huge pages alloc/free Lee Schermerhorn
  2009-08-24 19:25 ` [PATCH 1/5] hugetlb: rework hstate_next_node_* functions Lee Schermerhorn
  2009-08-24 19:26 ` [PATCH 2/5] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns Lee Schermerhorn
@ 2009-08-24 19:27 ` Lee Schermerhorn
  2009-08-25  8:47     ` David Rientjes
  2009-08-25 10:22     ` Mel Gorman
  2009-08-24 19:29 ` [PATCH 4/5] hugetlb: add per node hstate attributes Lee Schermerhorn
  2009-08-24 19:30 ` [PATCH 5/5] hugetlb: update hugetlb documentation for mempolicy based management Lee Schermerhorn
  4 siblings, 2 replies; 51+ messages in thread
From: Lee Schermerhorn @ 2009-08-24 19:27 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

[PATCH 3/4] hugetlb:  derive huge pages nodes allowed from task mempolicy

Against: 2.6.31-rc6-mmotm-090820-1918

V2:
+ cleaned up comments, removed some deemed unnecessary,
  add some suggested by review
+ removed check for !current in huge_mpol_nodes_allowed().
+ added 'current->comm' to warning message in huge_mpol_nodes_allowed().
+ added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to
  catch out of range node id.
+ add examples to patch description

V3: Factored this patch from V2 patch 2/3

V4: added back missing "kfree(nodes_allowed)" in set_max_nr_hugepages()

This patch derives a "nodes_allowed" node mask from the numa
mempolicy of the task modifying the number of persistent huge
pages to control the allocation, freeing and adjusting of surplus
huge pages.  This mask is derived as follows:

* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
  is produced.  This will cause the hugetlb subsystem to use
  node_online_map as the "nodes_allowed".  This preserves the
  behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
  a nodemask with the single preferred node will be produced. 
  "local" policy will NOT track any internode migrations of the
  task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
  will be used.
* Other than to inform the construction of the nodes_allowed node
  mask, the actual mempolicy mode is ignored.  That is, all modes
  behave like interleave over the resulting nodes_allowed mask
  with no "fallback".

Notes:

1) This patch introduces a subtle change in behavior:  huge page
   allocation and freeing will be constrained by any mempolicy
   that the task adjusting the huge page pool inherits from its
   parent.  This policy could come from a distant ancestor.  The
   adminstrator adjusting the huge page pool without explicitly
   specifying a mempolicy via numactl might be surprised by this.
   Additionaly, any mempolicy specified by numactl will be
   constrained by the cpuset in which numactl is invoked.

2) Hugepages allocated at boot time use the node_online_map.
   An additional patch could implement a temporary boot time
   huge pages nodes_allowed command line parameter.

3) Using mempolicy to control persistent huge page allocation
   and freeing requires no change to hugeadm when invoking
   it via numactl, as shown in the examples below.  However,
   hugeadm could be enhanced to take the allowed nodes as an
   argument and set its task mempolicy itself.  This would allow
   it to detect and warn about any non-default mempolicy that it
   inherited from its parent, thus alleviating the issue described
   in Note 1 above.

See the updated documentation [next patch] for more information
about the implications of this patch.

Examples:

Starting with:

	Node 0 HugePages_Total:     0
	Node 1 HugePages_Total:     0
	Node 2 HugePages_Total:     0
	Node 3 HugePages_Total:     0

Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:

	hugeadm --pool-pages-min=2048Kb:32

yields:

	Node 0 HugePages_Total:     8
	Node 1 HugePages_Total:     8
	Node 2 HugePages_Total:     8
	Node 3 HugePages_Total:     8

Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes.  So, starting from the 
condition above, with 8 huge pages per node:

	numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8

yields:

	Node 0 HugePages_Total:     8
	Node 1 HugePages_Total:     8
	Node 2 HugePages_Total:    16
	Node 3 HugePages_Total:     8

The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.

Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:

	numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8

yields:

	Node 0 HugePages_Total:     4
	Node 1 HugePages_Total:     4
	Node 2 HugePages_Total:    16
	Node 3 HugePages_Total:     8

The 8 huge pages freed were balanced over nodes 0 and 1.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/mempolicy.h |    3 ++
 mm/hugetlb.c              |   14 ++++++----
 mm/mempolicy.c            |   61 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 73 insertions(+), 5 deletions(-)

Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c
===================================================================
--- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/mempolicy.c	2009-08-24 12:12:44.000000000 -0400
+++ linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c	2009-08-24 12:12:53.000000000 -0400
@@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm
 	}
 	return zl;
 }
+
+/*
+ * huge_mpol_nodes_allowed -- mempolicy extension for huge pages.
+ *
+ * Returns a [pointer to a] nodelist based on the current task's mempolicy
+ * to constraing the allocation and freeing of persistent huge pages
+ * 'Preferred', 'local' and 'interleave' mempolicy will behave more like
+ * 'bind' policy in this context.  An attempt to allocate a persistent huge
+ * page will never "fallback" to another node inside the buddy system
+ * allocator.
+ *
+ * If the task's mempolicy is "default" [NULL], just return NULL for
+ * default behavior.  Otherwise, extract the policy nodemask for 'bind'
+ * or 'interleave' policy or construct a nodemask for 'preferred' or
+ * 'local' policy and return a pointer to a kmalloc()ed nodemask_t.
+ *
+ * N.B., it is the caller's responsibility to free a returned nodemask.
+ */
+nodemask_t *huge_mpol_nodes_allowed(void)
+{
+	nodemask_t *nodes_allowed = NULL;
+	struct mempolicy *mempolicy;
+	int nid;
+
+	if (!current->mempolicy)
+		return NULL;
+
+	mpol_get(current->mempolicy);
+	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
+	if (!nodes_allowed) {
+		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
+			"for huge page allocation.\nFalling back to default.\n",
+			current->comm);
+		goto out;
+	}
+	nodes_clear(*nodes_allowed);
+
+	mempolicy = current->mempolicy;
+	switch (mempolicy->mode) {
+	case MPOL_PREFERRED:
+		if (mempolicy->flags & MPOL_F_LOCAL)
+			nid = numa_node_id();
+		else
+			nid = mempolicy->v.preferred_node;
+		node_set(nid, *nodes_allowed);
+		break;
+
+	case MPOL_BIND:
+		/* Fall through */
+	case MPOL_INTERLEAVE:
+		*nodes_allowed =  mempolicy->v.nodes;
+		break;
+
+	default:
+		BUG();
+	}
+
+out:
+	mpol_put(current->mempolicy);
+	return nodes_allowed;
+}
 #endif
 
 /* Allocate a page in interleaved policy.
Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h
===================================================================
--- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/mempolicy.h	2009-08-24 12:12:44.000000000 -0400
+++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h	2009-08-24 12:12:53.000000000 -0400
@@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
+extern nodemask_t *huge_mpol_nodes_allowed(void);
 extern unsigned slab_node(struct mempolicy *policy);
 
 extern enum zone_type policy_zone;
@@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone
 	return node_zonelist(0, gfp_flags);
 }
 
+static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; }
+
 static inline int do_migrate_pages(struct mm_struct *mm,
 			const nodemask_t *from_nodes,
 			const nodemask_t *to_nodes, int flags)
Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-24 12:12:50.000000000 -0400
+++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-24 12:12:53.000000000 -0400
@@ -1257,10 +1257,13 @@ static int adjust_pool_surplus(struct hs
 static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
 {
 	unsigned long min_count, ret;
+	nodemask_t *nodes_allowed;
 
 	if (h->order >= MAX_ORDER)
 		return h->max_huge_pages;
 
+	nodes_allowed = huge_mpol_nodes_allowed();
+
 	/*
 	 * Increase the pool size
 	 * First take pages out of surplus state.  Then make up the
@@ -1274,7 +1277,7 @@ static unsigned long set_max_huge_pages(
 	 */
 	spin_lock(&hugetlb_lock);
 	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, NULL, -1))
+		if (!adjust_pool_surplus(h, nodes_allowed, -1))
 			break;
 	}
 
@@ -1285,7 +1288,7 @@ static unsigned long set_max_huge_pages(
 		 * and reducing the surplus.
 		 */
 		spin_unlock(&hugetlb_lock);
-		ret = alloc_fresh_huge_page(h, NULL);
+		ret = alloc_fresh_huge_page(h, nodes_allowed);
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -1309,18 +1312,19 @@ static unsigned long set_max_huge_pages(
 	 */
 	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
 	min_count = max(count, min_count);
-	try_to_free_low(h, min_count, NULL);
+	try_to_free_low(h, min_count, nodes_allowed);
 	while (min_count < persistent_huge_pages(h)) {
-		if (!free_pool_huge_page(h, NULL, 0))
+		if (!free_pool_huge_page(h, nodes_allowed, 0))
 			break;
 	}
 	while (count < persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, NULL, 1))
+		if (!adjust_pool_surplus(h, nodes_allowed, 1))
 			break;
 	}
 out:
 	ret = persistent_huge_pages(h);
 	spin_unlock(&hugetlb_lock);
+	kfree(nodes_allowed);
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 4/5] hugetlb:  add per node hstate attributes
  2009-08-24 19:24 [PATCH 0/5] hugetlb: numa control of persistent huge pages alloc/free Lee Schermerhorn
                   ` (2 preceding siblings ...)
  2009-08-24 19:27 ` [PATCH 3/5] hugetlb: derive huge pages nodes allowed from task mempolicy Lee Schermerhorn
@ 2009-08-24 19:29 ` Lee Schermerhorn
  2009-08-25 10:19     ` Mel Gorman
  2009-08-25 13:35     ` Mel Gorman
  2009-08-24 19:30 ` [PATCH 5/5] hugetlb: update hugetlb documentation for mempolicy based management Lee Schermerhorn
  4 siblings, 2 replies; 51+ messages in thread
From: Lee Schermerhorn @ 2009-08-24 19:29 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

PATCH/RFC 5/4 hugetlb:  register per node hugepages attributes

Against: 2.6.31-rc6-mmotm-090820-1918

V2:  remove dependency on kobject private bitfield.  Search
     global hstates then all per node hstates for kobject
     match in attribute show/store functions.

V3:  rebase atop the mempolicy-based hugepage alloc/free;
     use custom "nodes_allowed" to restrict alloc/free to
     a specific node via per node attributes.  Per node
     attribute overrides mempolicy.  I.e., mempolicy only
     applies to global attributes.

To demonstrate feasibility--if not advisability--of supporting
both mempolicy-based persistent huge page management with per
node "override" attributes.

This patch adds the per huge page size control/query attributes
to the per node sysdevs:

/sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
	nr_hugepages       - r/w
	free_huge_pages    - r/o
	surplus_huge_pages - r/o

The patch attempts to re-use/share as much of the existing
global hstate attribute initialization and handling, and the
"nodes_allowed" constraint processing as possible.
In set_max_huge_pages(), a node id < 0 indicates a change to
global hstate parameters.  In this case, any non-default task
mempolicy will be used to generate the nodes_allowed mask.  A
node id > 0 indicates a node specific update and the count 
argument specifies the target count for the node.  From this
info, we compute the target global count for the hstate and
construct a nodes_allowed node mask contain only the specified
node.  Thus, setting the node specific nr_hugepages via the
per node attribute effectively overrides any task mempolicy.


Issue:  dependency of base driver [node] dependency on hugetlbfs module.
We want to keep all of the hstate attribute registration and handling
in the hugetlb module.  However, we need to call into this code to
register the per node hstate attributes on node hot plug.

With this patch:

(me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
./  ../  free_hugepages  nr_hugepages  surplus_hugepages

Starting from:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     0
Node 2 HugePages_Free:      0
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 0

Allocate 16 persistent huge pages on node 2:
(me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

[Note that this is equivalent to:
	numactl -m 2 hugeadmin --pool-pages-min 2M:+16
]

Yields:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:    16
Node 2 HugePages_Free:     16
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 16

Global controls work as expected--reduce pool to 8 persistent huge pages:
(me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     8
Node 2 HugePages_Free:      8
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0


Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 drivers/base/node.c     |    2 
 include/linux/hugetlb.h |    6 +
 include/linux/node.h    |    3 
 mm/hugetlb.c            |  213 +++++++++++++++++++++++++++++++++++++++++-------
 4 files changed, 197 insertions(+), 27 deletions(-)

Index: linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c
===================================================================
--- linux-2.6.31-rc6-mmotm-090820-1918.orig/drivers/base/node.c	2009-08-24 12:12:44.000000000 -0400
+++ linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c	2009-08-24 12:12:56.000000000 -0400
@@ -200,6 +200,7 @@ int register_node(struct node *node, int
 		sysdev_create_file(&node->sysdev, &attr_distance);
 
 		scan_unevictable_register_node(node);
+		hugetlb_register_node(node);
 	}
 	return error;
 }
@@ -220,6 +221,7 @@ void unregister_node(struct node *node)
 	sysdev_remove_file(&node->sysdev, &attr_distance);
 
 	scan_unevictable_unregister_node(node);
+	hugetlb_unregister_node(node);
 
 	sysdev_unregister(&node->sysdev);
 }
Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h
===================================================================
--- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/hugetlb.h	2009-08-24 12:12:44.000000000 -0400
+++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h	2009-08-24 12:12:56.000000000 -0400
@@ -278,6 +278,10 @@ static inline struct hstate *page_hstate
 	return size_to_hstate(PAGE_SIZE << compound_order(page));
 }
 
+struct node;
+extern void hugetlb_register_node(struct node *);
+extern void hugetlb_unregister_node(struct node *);
+
 #else
 struct hstate {};
 #define alloc_bootmem_huge_page(h) NULL
@@ -294,6 +298,8 @@ static inline unsigned int pages_per_hug
 {
 	return 1;
 }
+#define hugetlb_register_node(NP)
+#define hugetlb_unregister_node(NP)
 #endif
 
 #endif /* _LINUX_HUGETLB_H */
Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-24 12:12:53.000000000 -0400
+++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-24 12:12:56.000000000 -0400
@@ -24,6 +24,7 @@
 #include <asm/io.h>
 
 #include <linux/hugetlb.h>
+#include <linux/node.h>
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
@@ -1253,8 +1254,24 @@ static int adjust_pool_surplus(struct hs
 	return ret;
 }
 
+static nodemask_t *nodes_allowed_from_node(int nid)
+{
+	nodemask_t *nodes_allowed;
+	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
+	if (!nodes_allowed) {
+		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
+			"for huge page allocation.\nFalling back to default.\n",
+			current->comm);
+	} else {
+		nodes_clear(*nodes_allowed);
+		node_set(nid, *nodes_allowed);
+	}
+	return nodes_allowed;
+}
+
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
-static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
+static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
+								int nid)
 {
 	unsigned long min_count, ret;
 	nodemask_t *nodes_allowed;
@@ -1262,7 +1279,17 @@ static unsigned long set_max_huge_pages(
 	if (h->order >= MAX_ORDER)
 		return h->max_huge_pages;
 
-	nodes_allowed = huge_mpol_nodes_allowed();
+	if (nid < 0)
+		nodes_allowed = huge_mpol_nodes_allowed();
+	else {
+		/*
+		 * incoming 'count' is for node 'nid' only, so
+		 * adjust count to global, but restrict alloc/free
+		 * to the specified node.
+		 */
+		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
+		nodes_allowed = nodes_allowed_from_node(nid);
+	}
 
 	/*
 	 * Increase the pool size
@@ -1338,34 +1365,69 @@ out:
 static struct kobject *hugepages_kobj;
 static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
 
-static struct hstate *kobj_to_hstate(struct kobject *kobj)
+static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		struct node *node = &node_devices[nid];
+		int hi;
+		for (hi = 0; hi < HUGE_MAX_HSTATE; hi++)
+			if (node->hstate_kobjs[hi] == kobj) {
+				if (nidp)
+					*nidp = nid;
+				return &hstates[hi];
+			}
+	}
+
+	BUG();
+	return NULL;
+}
+
+static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
 {
 	int i;
+
 	for (i = 0; i < HUGE_MAX_HSTATE; i++)
-		if (hstate_kobjs[i] == kobj)
+		if (hstate_kobjs[i] == kobj) {
+			if (nidp)
+				*nidp = -1;
 			return &hstates[i];
-	BUG();
-	return NULL;
+		}
+
+	return kobj_to_node_hstate(kobj, nidp);
 }
 
 static ssize_t nr_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->nr_huge_pages);
+	struct hstate *h;
+	unsigned long nr_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid < 0)
+		nr_huge_pages = h->nr_huge_pages;
+	else
+		nr_huge_pages = h->nr_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", nr_huge_pages);
 }
+
 static ssize_t nr_hugepages_store(struct kobject *kobj,
 		struct kobj_attribute *attr, const char *buf, size_t count)
 {
-	int err;
 	unsigned long input;
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h;
+	int nid;
+	int err;
 
 	err = strict_strtoul(buf, 10, &input);
 	if (err)
 		return 0;
 
-	h->max_huge_pages = set_max_huge_pages(h, input);
+	h = kobj_to_hstate(kobj, &nid);
+	h->max_huge_pages = set_max_huge_pages(h, input, nid);
 
 	return count;
 }
@@ -1374,15 +1436,17 @@ HSTATE_ATTR(nr_hugepages);
 static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
+
 	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
 }
+
 static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
 		struct kobj_attribute *attr, const char *buf, size_t count)
 {
 	int err;
 	unsigned long input;
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
 
 	err = strict_strtoul(buf, 10, &input);
 	if (err)
@@ -1399,15 +1463,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
 static ssize_t free_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->free_huge_pages);
+	struct hstate *h;
+	unsigned long free_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid < 0)
+		free_huge_pages = h->free_huge_pages;
+	else
+		free_huge_pages = h->free_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", free_huge_pages);
 }
 HSTATE_ATTR_RO(free_hugepages);
 
 static ssize_t resv_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
 	return sprintf(buf, "%lu\n", h->resv_huge_pages);
 }
 HSTATE_ATTR_RO(resv_hugepages);
@@ -1415,8 +1488,17 @@ HSTATE_ATTR_RO(resv_hugepages);
 static ssize_t surplus_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
+	struct hstate *h;
+	unsigned long surplus_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid < 0)
+		surplus_huge_pages = h->surplus_huge_pages;
+	else
+		surplus_huge_pages = h->surplus_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", surplus_huge_pages);
 }
 HSTATE_ATTR_RO(surplus_hugepages);
 
@@ -1433,19 +1515,21 @@ static struct attribute_group hstate_att
 	.attrs = hstate_attrs,
 };
 
-static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
+static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
+				struct kobject *parent,
+				struct kobject **hstate_kobjs,
+				struct attribute_group *hstate_attr_group)
 {
 	int retval;
+	int hi = h - hstates;
 
-	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
-							hugepages_kobj);
-	if (!hstate_kobjs[h - hstates])
+	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
+	if (!hstate_kobjs[hi])
 		return -ENOMEM;
 
-	retval = sysfs_create_group(hstate_kobjs[h - hstates],
-							&hstate_attr_group);
+	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
 	if (retval)
-		kobject_put(hstate_kobjs[h - hstates]);
+		kobject_put(hstate_kobjs[hi]);
 
 	return retval;
 }
@@ -1460,17 +1544,90 @@ static void __init hugetlb_sysfs_init(vo
 		return;
 
 	for_each_hstate(h) {
-		err = hugetlb_sysfs_add_hstate(h);
+		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
+					 hstate_kobjs, &hstate_attr_group);
 		if (err)
 			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
 								h->name);
 	}
 }
 
+#ifdef CONFIG_NUMA
+static struct attribute *per_node_hstate_attrs[] = {
+	&nr_hugepages_attr.attr,
+	&free_hugepages_attr.attr,
+	&surplus_hugepages_attr.attr,
+	NULL,
+};
+
+static struct attribute_group per_node_hstate_attr_group = {
+	.attrs = per_node_hstate_attrs,
+};
+
+
+void hugetlb_unregister_node(struct node *node)
+{
+	struct hstate *h;
+
+	for_each_hstate(h) {
+		kobject_put(node->hstate_kobjs[h - hstates]);
+		node->hstate_kobjs[h - hstates] = NULL;
+	}
+
+	kobject_put(node->hugepages_kobj);
+	node->hugepages_kobj = NULL;
+}
+
+static void hugetlb_unregister_all_nodes(void)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++)
+		hugetlb_unregister_node(&node_devices[nid]);
+}
+
+void hugetlb_register_node(struct node *node)
+{
+	struct hstate *h;
+	int err;
+
+	if (!hugepages_kobj)
+		return;		/* too early */
+
+	node->hugepages_kobj = kobject_create_and_add("hugepages",
+							&node->sysdev.kobj);
+	if (!node->hugepages_kobj)
+		return;
+
+	for_each_hstate(h) {
+		err = hugetlb_sysfs_add_hstate(h, node->hugepages_kobj,
+						node->hstate_kobjs,
+						&per_node_hstate_attr_group);
+		if (err)
+			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
+					" for node %d\n",
+						h->name, node->sysdev.id);
+	}
+}
+
+static void hugetlb_register_all_nodes(void)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		struct node *node = &node_devices[nid];
+		if (node->sysdev.id == nid && !node->hugepages_kobj)
+			hugetlb_register_node(node);
+	}
+}
+#endif
+
 static void __exit hugetlb_exit(void)
 {
 	struct hstate *h;
 
+	hugetlb_unregister_all_nodes();
+
 	for_each_hstate(h) {
 		kobject_put(hstate_kobjs[h - hstates]);
 	}
@@ -1505,6 +1662,8 @@ static int __init hugetlb_init(void)
 
 	hugetlb_sysfs_init();
 
+	hugetlb_register_all_nodes();
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -1607,7 +1766,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
 	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
 
 	if (write)
-		h->max_huge_pages = set_max_huge_pages(h, tmp);
+		h->max_huge_pages = set_max_huge_pages(h, tmp, -1);
 
 	return 0;
 }
Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h
===================================================================
--- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h	2009-08-24 12:12:44.000000000 -0400
+++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h	2009-08-24 12:12:56.000000000 -0400
@@ -21,9 +21,12 @@
 
 #include <linux/sysdev.h>
 #include <linux/cpumask.h>
+#include <linux/hugetlb.h>
 
 struct node {
 	struct sys_device	sysdev;
+	struct kobject		*hugepages_kobj;
+	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
 };
 
 struct memory_block;

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 5/5] hugetlb:  update hugetlb documentation for mempolicy based management.
  2009-08-24 19:24 [PATCH 0/5] hugetlb: numa control of persistent huge pages alloc/free Lee Schermerhorn
                   ` (3 preceding siblings ...)
  2009-08-24 19:29 ` [PATCH 4/5] hugetlb: add per node hstate attributes Lee Schermerhorn
@ 2009-08-24 19:30 ` Lee Schermerhorn
  4 siblings, 0 replies; 51+ messages in thread
From: Lee Schermerhorn @ 2009-08-24 19:30 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

PATCH 4/4 hugetlb:  update hugetlb documentation for mempolicy based management.

Against: 2.6.31-rc6-mmotm-090820-1918

V2:  Add brief description of per node attributes.

This patch updates the kernel huge tlb documentation to describe the
numa memory policy based huge page management.  Additionaly, the patch
includes a fair amount of rework to improve consistency, eliminate
duplication and set the context for documenting the memory policy
interaction.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/vm/hugetlbpage.txt |  257 ++++++++++++++++++++++++++-------------
 1 file changed, 172 insertions(+), 85 deletions(-)

Index: linux-2.6.31-rc6-mmotm-090820-1918/Documentation/vm/hugetlbpage.txt
===================================================================
--- linux-2.6.31-rc6-mmotm-090820-1918.orig/Documentation/vm/hugetlbpage.txt	2009-08-24 12:12:44.000000000 -0400
+++ linux-2.6.31-rc6-mmotm-090820-1918/Documentation/vm/hugetlbpage.txt	2009-08-24 12:50:49.000000000 -0400
@@ -11,23 +11,21 @@ This optimization is more critical now a
 (several GBs) are more readily available.
 
 Users can use the huge page support in Linux kernel by either using the mmap
-system call or standard SYSv shared memory system calls (shmget, shmat).
+system call or standard SYSV shared memory system calls (shmget, shmat).
 
 First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
 (present under "File systems") and CONFIG_HUGETLB_PAGE (selected
 automatically when CONFIG_HUGETLBFS is selected) configuration
 options.
 
-The kernel built with huge page support should show the number of configured
-huge pages in the system by running the "cat /proc/meminfo" command.
+The /proc/meminfo file provides information about the total number of hugetlb
+pages preallocated in the kernel's huge page pool.  It also displays
+information about the number of free, reserved and surplus huge pages and the
+[default] huge page size.  The huge page size is needed for generating the
+proper alignment and size of the arguments to system calls that map huge page
+regions.
 
-/proc/meminfo also provides information about the total number of hugetlb
-pages configured in the kernel.  It also displays information about the
-number of free hugetlb pages at any time.  It also displays information about
-the configured huge page size - this is needed for generating the proper
-alignment and size of the arguments to the above system calls.
-
-The output of "cat /proc/meminfo" will have lines like:
+The output of "cat /proc/meminfo" will include lines like:
 
 .....
 HugePages_Total: vvv
@@ -53,26 +51,25 @@ HugePages_Surp  is short for "surplus,"
 /proc/filesystems should also show a filesystem of type "hugetlbfs" configured
 in the kernel.
 
-/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
-pages in the kernel.  Super user can dynamically request more (or free some
-pre-configured) huge pages.
-The allocation (or deallocation) of hugetlb pages is possible only if there are
-enough physically contiguous free pages in system (freeing of huge pages is
-possible only if there are enough hugetlb pages free that can be transferred
-back to regular memory pool).
-
-Pages that are used as hugetlb pages are reserved inside the kernel and cannot
-be used for other purposes.
-
-Once the kernel with Hugetlb page support is built and running, a user can
-use either the mmap system call or shared memory system calls to start using
-the huge pages.  It is required that the system administrator preallocate
-enough memory for huge page purposes.
-
-The administrator can preallocate huge pages on the kernel boot command line by
-specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
-requested.  This is the most reliable method for preallocating huge pages as
-memory has not yet become fragmented.
+/proc/sys/vm/nr_hugepages indicates the current number of huge pages pre-
+allocated in the kernel's huge page pool.  These are called "persistent"
+huge pages.  A user with root privileges can dynamically allocate more or
+free some persistent huge pages by increasing or decreasing the value of
+'nr_hugepages'.
+
+Pages that are used as huge pages are reserved inside the kernel and cannot
+be used for other purposes.  Huge pages can not be swapped out under
+memory pressure.
+
+Once a number of huge pages have been pre-allocated to the kernel huge page
+pool, a user with appropriate privilege can use either the mmap system call
+or shared memory system calls to use the huge pages.  See the discussion of
+Using Huge Pages, below
+
+The administrator can preallocate persistent huge pages on the kernel boot
+command line by specifying the "hugepages=N" parameter, where 'N' = the
+number of requested huge pages requested.  This is the most reliable method
+or preallocating huge pages as memory has not yet become fragmented.
 
 Some platforms support multiple huge page sizes.  To preallocate huge pages
 of a specific size, one must preceed the huge pages boot command parameters
@@ -80,19 +77,24 @@ with a huge page size selection paramete
 be specified in bytes with optional scale suffix [kKmMgG].  The default huge
 page size may be selected with the "default_hugepagesz=<size>" boot parameter.
 
-/proc/sys/vm/nr_hugepages indicates the current number of configured [default
-size] hugetlb pages in the kernel.  Super user can dynamically request more
-(or free some pre-configured) huge pages.
-
-Use the following command to dynamically allocate/deallocate default sized
-huge pages:
+When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
+indicates the current number of pre-allocated huge pages of the default size.
+Thus, one can use the following command to dynamically allocate/deallocate
+default sized persistent huge pages:
 
 	echo 20 > /proc/sys/vm/nr_hugepages
 
-This command will try to configure 20 default sized huge pages in the system.
+This command will try to adjust the number of default sized huge pages in the
+huge page pool to 20, allocating or freeing huge pages, as required.
+
 On a NUMA platform, the kernel will attempt to distribute the huge page pool
-over the all on-line nodes.  These huge pages, allocated when nr_hugepages
-is increased, are called "persistent huge pages".
+over the all the nodes specified by the NUMA memory policy of the task that
+modifies nr_hugepages that contain sufficient available contiguous memory.
+These nodes are called the huge pages "allowed nodes".  The default for the
+huge pages allowed nodes--when the task has default memory policy--is all
+on-line nodes.  See the discussion below of the interaction of task memory
+policy, cpusets and per node attributes with the allocation and freeing of
+persistent huge pages.
 
 The success or failure of huge page allocation depends on the amount of
 physically contiguous memory that is preset in system at the time of the
@@ -101,11 +103,11 @@ some nodes in a NUMA system, it will att
 allocating extra pages on other nodes with sufficient available contiguous
 memory, if any.
 
-System administrators may want to put this command in one of the local rc init
-files.  This will enable the kernel to request huge pages early in the boot
-process when the possibility of getting physical contiguous pages is still
-very high.  Administrators can verify the number of huge pages actually
-allocated by checking the sysctl or meminfo.  To check the per node
+System administrators may want to put this command in one of the local rc
+init files.  This will enable the kernel to preallocate huge pages early in
+the boot process when the possibility of getting physical contiguous pages
+is still very high.  Administrators can verify the number of huge pages
+actually allocated by checking the sysctl or meminfo.  To check the per node
 distribution of huge pages in a NUMA system, use:
 
 	cat /sys/devices/system/node/node*/meminfo | fgrep Huge
@@ -113,39 +115,40 @@ distribution of huge pages in a NUMA sys
 /proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
 huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
 requested by applications.  Writing any non-zero value into this file
-indicates that the hugetlb subsystem is allowed to try to obtain "surplus"
-huge pages from the buddy allocator, when the normal pool is exhausted. As
-these surplus huge pages go out of use, they are freed back to the buddy
-allocator.
+indicates that the hugetlb subsystem is allowed to try to obtain that
+number of "surplus" huge pages from the kernel's normal page pool, when the
+persistent huge page pool is exhausted. As these surplus huge pages become
+unused, they are freed back to the kernel's normal page pool.
 
-When increasing the huge page pool size via nr_hugepages, any surplus
+When increasing the huge page pool size via nr_hugepages, any existing surplus
 pages will first be promoted to persistent huge pages.  Then, additional
 huge pages will be allocated, if necessary and if possible, to fulfill
-the new huge page pool size.
+the new persistent huge page pool size.
 
 The administrator may shrink the pool of preallocated huge pages for
 the default huge page size by setting the nr_hugepages sysctl to a
 smaller value.  The kernel will attempt to balance the freeing of huge pages
-across all on-line nodes.  Any free huge pages on the selected nodes will
-be freed back to the buddy allocator.
-
-Caveat: Shrinking the pool via nr_hugepages such that it becomes less
-than the number of huge pages in use will convert the balance to surplus
-huge pages even if it would exceed the overcommit value.  As long as
-this condition holds, however, no more surplus huge pages will be
-allowed on the system until one of the two sysctls are increased
-sufficiently, or the surplus huge pages go out of use and are freed.
+across all nodes in the memory policy of the task modifying nr_hugepages.
+Any free huge pages on the selected nodes will be freed back to the kernel's
+normal page pool.
+
+Caveat: Shrinking the persistent huge page pool via nr_hugepages such that
+it becomes less than the number of huge pages in use will convert the balance
+of the in-use huge pages to surplus huge pages.  This will occur even if
+the number of surplus pages it would exceed the overcommit value.  As long as
+this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is
+increased sufficiently, or the surplus huge pages go out of use and are freed--
+no more surplus huge pages will be allowed to be allocated.
 
 With support for multiple huge page pools at run-time available, much of
-the huge page userspace interface has been duplicated in sysfs. The above
-information applies to the default huge page size which will be
-controlled by the /proc interfaces for backwards compatibility. The root
-huge page control directory in sysfs is:
+the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs.
+The /proc interfaces discussed above have been retained for backwards
+compatibility. The root huge page control directory in sysfs is:
 
 	/sys/kernel/mm/hugepages
 
 For each huge page size supported by the running kernel, a subdirectory
-will exist, of the form
+will exist, of the form:
 
 	hugepages-${size}kB
 
@@ -159,6 +162,98 @@ Inside each of these directories, the sa
 
 which function as described above for the default huge page-sized case.
 
+
+Interaction of Task Memory Policy with Huge Page Allocation/Freeing:
+
+Whether huge pages are allocated and freed via the /proc interface or
+the /sysfs interface, the NUMA nodes from which huge pages are allocated
+or freed are controlled by the NUMA memory policy of the task that modifies
+the nr_hugepages parameter.  [nr_overcommit_hugepages is a global limit.]
+
+The recommended method to allocate or free huge pages to/from the kernel
+huge page pool, using the nr_hugepages example above, is:
+
+    numactl --interleave <node-list> echo 20 >/proc/sys/vm/nr_hugepages.
+
+or, more succinctly:
+
+    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages.
+
+This will allocate or free abs(20 - nr_hugepages) to or from the nodes
+specified in <node-list>, depending on whether nr_hugepages is initially
+less than or greater than 20, respectively.  No huge pages will be
+allocated nor freed on any node not included in the specified <node-list>.
+
+Any memory policy mode--bind, preferred, local or interleave--may be
+used.  The effect on persistent huge page allocation will be as follows:
+
+1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
+   persistent huge pages will be distributed across the node or nodes
+   specified in the mempolicy as if "interleave" had been specified.
+   However, if a node in the policy does not contain sufficient contiguous
+   memory for a huge page, the allocation will not "fallback" to the nearest
+   neighbor node with sufficient contiguous memory.  To do this would cause
+   undesirable imbalance in the distribution of the huge page pool, or
+   possibly, allocation of persistent huge pages on nodes not allowed by
+   the task's memory policy.
+
+2) One or more nodes may be specified with the bind or interleave policy.
+   If more than one node is specified with the preferred policy, only the
+   lowest numeric id will be used.  Local policy will select the node where
+   the task is running at the time the nodes_allowed mask is constructed.
+
+3) For local policy to be deterministic, the task must be bound to a cpu or
+   cpus in a single node.  Otherwise, the task could be migrated to some
+   other node at any time after launch and the resulting node will be
+   indeterminate.  Thus, local policy is not very useful for this purpose.
+   Any of the other mempolicy modes may be used to specify a single node.
+
+4) The nodes allowed mask will be derived from any non-default task mempolicy,
+   whether this policy was set explicitly by the task itself or one of its
+   ancestors, such as numactl.  This means that if the task is invoked from a
+   shell with non-default policy, that policy will be used.  One can specify a
+   node list of "all" with numactl --interleave or --membind [-m] to achieve
+   interleaving over all nodes in the system or cpuset.
+
+5) Any task mempolicy specifed--e.g., using numactl--will be constrained by
+   the resource limits of any cpuset in which the task runs.  Thus, there will
+   be no way for a task with non-default policy running in a cpuset with a
+   subset of the system nodes to allocate huge pages outside the cpuset
+   without first moving to a cpuset that contains all of the desired nodes.
+
+6) Hugepages allocated at boot time always use the node_online_map.
+
+
+Per Node Hugepages Attributes
+
+A subset of the contents of the root huge page control directory in sysfs,
+described above, has been replicated under each "node" system device in:
+
+	/sys/devices/system/node/node[0-9]*/hugepages/
+
+Under this directory, the subdirectory for each supported huge page size
+contains the following attribute files:
+
+	nr_hugepages
+	free_hugepages
+	surplus_hugepages
+
+The free_' and surplus_' attribute files are read-only.  They return the number
+of free and surplus [overcommitted] huge pages, respectively, on the parent
+node.
+
+The nr_hugepages attribute will return the total number of huge pages on the
+specified node.  When this attribute is written, the number of persistent huge
+pages on the parent node will be adjusted to the specified value, if sufficient
+resources exist, regardless of the task's mempolicy or cpuset constraints.
+
+Note that the number of overcommit and reserve pages remain global quantities,
+as we don't know until fault time, when the faulting task's mempolicy is applied,
+from which node the huge page allocation will be attempted.
+
+
+Using Huge Pages:
+
 If the user applications are going to request huge pages using mmap system
 call, then it is required that system administrator mount a file system of
 type hugetlbfs:
@@ -204,9 +299,11 @@ mount of filesystem will be required for
  * requesting huge pages.
  *
  * For the ia64 architecture, the Linux kernel reserves Region number 4 for
- * huge pages.  That means the addresses starting with 0x800000... will need
- * to be specified.  Specifying a fixed address is not required on ppc64,
- * i386 or x86_64.
+ * huge pages.  That means that if one requires a fixed address, a huge page
+ * aligned address starting with 0x800000... will be required.  If a fixed
+ * address is not required, the kernel will select an address in the proper
+ * range.
+ * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
  *
  * Note: The default shared memory limit is quite low on many kernels,
  * you may need to increase it via:
@@ -235,14 +332,8 @@ mount of filesystem will be required for
 
 #define dprintf(x)  printf(x)
 
-/* Only ia64 requires this */
-#ifdef __ia64__
-#define ADDR (void *)(0x8000000000000000UL)
-#define SHMAT_FLAGS (SHM_RND)
-#else
-#define ADDR (void *)(0x0UL)
+#define ADDR (void *)(0x0UL)	/* let kernel choose address */
 #define SHMAT_FLAGS (0)
-#endif
 
 int main(void)
 {
@@ -300,10 +391,12 @@ int main(void)
  * example, the app is requesting memory of size 256MB that is backed by
  * huge pages.
  *
- * For ia64 architecture, Linux kernel reserves Region number 4 for huge pages.
- * That means the addresses starting with 0x800000... will need to be
- * specified.  Specifying a fixed address is not required on ppc64, i386
- * or x86_64.
+ * For the ia64 architecture, the Linux kernel reserves Region number 4 for
+ * huge pages.  That means that if one requires a fixed address, a huge page
+ * aligned address starting with 0x800000... will be required.  If a fixed
+ * address is not required, the kernel will select an address in the proper
+ * range.
+ * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
  */
 #include <stdlib.h>
 #include <stdio.h>
@@ -315,14 +408,8 @@ int main(void)
 #define LENGTH (256UL*1024*1024)
 #define PROTECTION (PROT_READ | PROT_WRITE)
 
-/* Only ia64 requires this */
-#ifdef __ia64__
-#define ADDR (void *)(0x8000000000000000UL)
-#define FLAGS (MAP_SHARED | MAP_FIXED)
-#else
-#define ADDR (void *)(0x0UL)
+#define ADDR (void *)(0x0UL)	/* let kernel choose address */
 #define FLAGS (MAP_SHARED)
-#endif
 
 void check_bytes(char *addr)
 {

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/5] hugetlb:  rework hstate_next_node_* functions
  2009-08-24 19:25 ` [PATCH 1/5] hugetlb: rework hstate_next_node_* functions Lee Schermerhorn
@ 2009-08-25  8:10     ` David Rientjes
  0 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2009-08-25  8:10 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, Andrew Morton, Mel Gorman,
	Nishanth Aravamudan, Adam Litke, Andy Whitcroft, eric.whitney

On Mon, 24 Aug 2009, Lee Schermerhorn wrote:

> [PATCH 1/5] hugetlb:  rework hstate_next_node* functions
> 
> Against: 2.6.31-rc6-mmotm-090820-1918
> 
> V2:
> + cleaned up comments, removed some deemed unnecessary,
>   add some suggested by review
> + removed check for !current in huge_mpol_nodes_allowed().
> + added 'current->comm' to warning message in huge_mpol_nodes_allowed().
> + added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to
>   catch out of range node id.
> + add examples to patch description
> 
> V3:
> + factored this "cleanup" patch out of V2 patch 2/3
> + moved ahead of patch to add nodes_allowed mask to alloc funcs
>   as this patch is somewhat independent from using task mempolicy
>   to control huge page allocation and freeing.
> 
> Modify the hstate_next_node* functions to allow them to be called to
> obtain the "start_nid".  Then, whereas prior to this patch we
> unconditionally called hstate_next_node_to_{alloc|free}(), whether
> or not we successfully allocated/freed a huge page on the node,
> now we only call these functions on failure to alloc/free to advance
> to next allowed node.
> 
> Factor out the next_node_allowed() function to handle wrap at end
> of node_online_map.  In this version, the allowed nodes include all 
> of the online nodes.
> 
> Reviewed-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

Acked-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/5] hugetlb:  rework hstate_next_node_* functions
@ 2009-08-25  8:10     ` David Rientjes
  0 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2009-08-25  8:10 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, Andrew Morton, Mel Gorman,
	Nishanth Aravamudan, Adam Litke, Andy Whitcroft, eric.whitney

On Mon, 24 Aug 2009, Lee Schermerhorn wrote:

> [PATCH 1/5] hugetlb:  rework hstate_next_node* functions
> 
> Against: 2.6.31-rc6-mmotm-090820-1918
> 
> V2:
> + cleaned up comments, removed some deemed unnecessary,
>   add some suggested by review
> + removed check for !current in huge_mpol_nodes_allowed().
> + added 'current->comm' to warning message in huge_mpol_nodes_allowed().
> + added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to
>   catch out of range node id.
> + add examples to patch description
> 
> V3:
> + factored this "cleanup" patch out of V2 patch 2/3
> + moved ahead of patch to add nodes_allowed mask to alloc funcs
>   as this patch is somewhat independent from using task mempolicy
>   to control huge page allocation and freeing.
> 
> Modify the hstate_next_node* functions to allow them to be called to
> obtain the "start_nid".  Then, whereas prior to this patch we
> unconditionally called hstate_next_node_to_{alloc|free}(), whether
> or not we successfully allocated/freed a huge page on the node,
> now we only call these functions on failure to alloc/free to advance
> to next allowed node.
> 
> Factor out the next_node_allowed() function to handle wrap at end
> of node_online_map.  In this version, the allowed nodes include all 
> of the online nodes.
> 
> Reviewed-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/5] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns
  2009-08-24 19:26 ` [PATCH 2/5] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns Lee Schermerhorn
@ 2009-08-25  8:16     ` David Rientjes
  0 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2009-08-25  8:16 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney

On Mon, 24 Aug 2009, Lee Schermerhorn wrote:

> [PATCH 2/4] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns
> 
> Against: 2.6.31-rc6-mmotm-090820-1918
> 
> V3:
> + moved this patch to after the "rework" of hstate_next_node_to_...
>   functions as this patch is more specific to using task mempolicy
>   to control huge page allocation and freeing.
> 
> In preparation for constraining huge page allocation and freeing by the
> controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer
> to the allocate, free and surplus adjustment functions.  For now, pass
> NULL to indicate default behavior--i.e., use node_online_map.  A
> subsqeuent patch will derive a non-default mask from the controlling 
> task's numa mempolicy.
> 
> Reviewed-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  mm/hugetlb.c |  102 ++++++++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 67 insertions(+), 35 deletions(-)
> 
> Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-24 12:12:46.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-24 12:12:50.000000000 -0400
> @@ -622,19 +622,29 @@ static struct page *alloc_fresh_huge_pag
>  }
>  
>  /*
> - * common helper function for hstate_next_node_to_{alloc|free}.
> - * return next node in node_online_map, wrapping at end.
> + * common helper functions for hstate_next_node_to_{alloc|free}.
> + * We may have allocated or freed a huge pages based on a different
> + * nodes_allowed, previously, so h->next_node_to_{alloc|free} might
> + * be outside of *nodes_allowed.  Ensure that we use the next
> + * allowed node for alloc or free.
>   */
> -static int next_node_allowed(int nid)
> +static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
>  {
> -	nid = next_node(nid, node_online_map);
> +	nid = next_node(nid, *nodes_allowed);
>  	if (nid == MAX_NUMNODES)
> -		nid = first_node(node_online_map);
> +		nid = first_node(*nodes_allowed);
>  	VM_BUG_ON(nid >= MAX_NUMNODES);
>  
>  	return nid;
>  }
>  
> +static int this_node_allowed(int nid, nodemask_t *nodes_allowed)
> +{
> +	if (!node_isset(nid, *nodes_allowed))
> +		nid = next_node_allowed(nid, nodes_allowed);
> +	return nid;
> +}

Awkward name considering this doesn't simply return true or false as 
expected, it returns a nid.

> +
>  /*
>   * Use a helper variable to find the next node and then
>   * copy it back to next_nid_to_alloc afterwards:
> @@ -642,28 +652,34 @@ static int next_node_allowed(int nid)
>   * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
>   * But we don't need to use a spin_lock here: it really
>   * doesn't matter if occasionally a racer chooses the
> - * same nid as we do.  Move nid forward in the mask even
> - * if we just successfully allocated a hugepage so that
> - * the next caller gets hugepages on the next node.
> + * same nid as we do.  Move nid forward in the mask whether
> + * or not we just successfully allocated a hugepage so that
> + * the next allocation addresses the next node.
>   */
> -static int hstate_next_node_to_alloc(struct hstate *h)
> +static int hstate_next_node_to_alloc(struct hstate *h,
> +					nodemask_t *nodes_allowed)
>  {
>  	int nid, next_nid;
>  
> -	nid = h->next_nid_to_alloc;
> -	next_nid = next_node_allowed(nid);
> +	if (!nodes_allowed)
> +		nodes_allowed = &node_online_map;
> +
> +	nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed);
> +
> +	next_nid = next_node_allowed(nid, nodes_allowed);
>  	h->next_nid_to_alloc = next_nid;
> +
>  	return nid;
>  }

Don't need next_nid.

> -static int alloc_fresh_huge_page(struct hstate *h)
> +static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
>  {
>  	struct page *page;
>  	int start_nid;
>  	int next_nid;
>  	int ret = 0;
>  
> -	start_nid = hstate_next_node_to_alloc(h);
> +	start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
>  	next_nid = start_nid;
>  
>  	do {
> @@ -672,7 +688,7 @@ static int alloc_fresh_huge_page(struct
>  			ret = 1;
>  			break;
>  		}
> -		next_nid = hstate_next_node_to_alloc(h);
> +		next_nid = hstate_next_node_to_alloc(h, nodes_allowed);
>  	} while (next_nid != start_nid);
>  
>  	if (ret)
> @@ -689,13 +705,18 @@ static int alloc_fresh_huge_page(struct
>   * whether or not we find a free huge page to free so that the
>   * next attempt to free addresses the next node.
>   */
> -static int hstate_next_node_to_free(struct hstate *h)
> +static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
>  {
>  	int nid, next_nid;
>  
> -	nid = h->next_nid_to_free;
> -	next_nid = next_node_allowed(nid);
> +	if (!nodes_allowed)
> +		nodes_allowed = &node_online_map;
> +
> +	nid = this_node_allowed(h->next_nid_to_free, nodes_allowed);
> +
> +	next_nid = next_node_allowed(nid, nodes_allowed);
>  	h->next_nid_to_free = next_nid;
> +
>  	return nid;
>  }

Same.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/5] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns
@ 2009-08-25  8:16     ` David Rientjes
  0 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2009-08-25  8:16 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney

On Mon, 24 Aug 2009, Lee Schermerhorn wrote:

> [PATCH 2/4] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns
> 
> Against: 2.6.31-rc6-mmotm-090820-1918
> 
> V3:
> + moved this patch to after the "rework" of hstate_next_node_to_...
>   functions as this patch is more specific to using task mempolicy
>   to control huge page allocation and freeing.
> 
> In preparation for constraining huge page allocation and freeing by the
> controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer
> to the allocate, free and surplus adjustment functions.  For now, pass
> NULL to indicate default behavior--i.e., use node_online_map.  A
> subsqeuent patch will derive a non-default mask from the controlling 
> task's numa mempolicy.
> 
> Reviewed-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  mm/hugetlb.c |  102 ++++++++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 67 insertions(+), 35 deletions(-)
> 
> Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-24 12:12:46.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-24 12:12:50.000000000 -0400
> @@ -622,19 +622,29 @@ static struct page *alloc_fresh_huge_pag
>  }
>  
>  /*
> - * common helper function for hstate_next_node_to_{alloc|free}.
> - * return next node in node_online_map, wrapping at end.
> + * common helper functions for hstate_next_node_to_{alloc|free}.
> + * We may have allocated or freed a huge pages based on a different
> + * nodes_allowed, previously, so h->next_node_to_{alloc|free} might
> + * be outside of *nodes_allowed.  Ensure that we use the next
> + * allowed node for alloc or free.
>   */
> -static int next_node_allowed(int nid)
> +static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
>  {
> -	nid = next_node(nid, node_online_map);
> +	nid = next_node(nid, *nodes_allowed);
>  	if (nid == MAX_NUMNODES)
> -		nid = first_node(node_online_map);
> +		nid = first_node(*nodes_allowed);
>  	VM_BUG_ON(nid >= MAX_NUMNODES);
>  
>  	return nid;
>  }
>  
> +static int this_node_allowed(int nid, nodemask_t *nodes_allowed)
> +{
> +	if (!node_isset(nid, *nodes_allowed))
> +		nid = next_node_allowed(nid, nodes_allowed);
> +	return nid;
> +}

Awkward name considering this doesn't simply return true or false as 
expected, it returns a nid.

> +
>  /*
>   * Use a helper variable to find the next node and then
>   * copy it back to next_nid_to_alloc afterwards:
> @@ -642,28 +652,34 @@ static int next_node_allowed(int nid)
>   * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
>   * But we don't need to use a spin_lock here: it really
>   * doesn't matter if occasionally a racer chooses the
> - * same nid as we do.  Move nid forward in the mask even
> - * if we just successfully allocated a hugepage so that
> - * the next caller gets hugepages on the next node.
> + * same nid as we do.  Move nid forward in the mask whether
> + * or not we just successfully allocated a hugepage so that
> + * the next allocation addresses the next node.
>   */
> -static int hstate_next_node_to_alloc(struct hstate *h)
> +static int hstate_next_node_to_alloc(struct hstate *h,
> +					nodemask_t *nodes_allowed)
>  {
>  	int nid, next_nid;
>  
> -	nid = h->next_nid_to_alloc;
> -	next_nid = next_node_allowed(nid);
> +	if (!nodes_allowed)
> +		nodes_allowed = &node_online_map;
> +
> +	nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed);
> +
> +	next_nid = next_node_allowed(nid, nodes_allowed);
>  	h->next_nid_to_alloc = next_nid;
> +
>  	return nid;
>  }

Don't need next_nid.

> -static int alloc_fresh_huge_page(struct hstate *h)
> +static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
>  {
>  	struct page *page;
>  	int start_nid;
>  	int next_nid;
>  	int ret = 0;
>  
> -	start_nid = hstate_next_node_to_alloc(h);
> +	start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
>  	next_nid = start_nid;
>  
>  	do {
> @@ -672,7 +688,7 @@ static int alloc_fresh_huge_page(struct
>  			ret = 1;
>  			break;
>  		}
> -		next_nid = hstate_next_node_to_alloc(h);
> +		next_nid = hstate_next_node_to_alloc(h, nodes_allowed);
>  	} while (next_nid != start_nid);
>  
>  	if (ret)
> @@ -689,13 +705,18 @@ static int alloc_fresh_huge_page(struct
>   * whether or not we find a free huge page to free so that the
>   * next attempt to free addresses the next node.
>   */
> -static int hstate_next_node_to_free(struct hstate *h)
> +static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
>  {
>  	int nid, next_nid;
>  
> -	nid = h->next_nid_to_free;
> -	next_nid = next_node_allowed(nid);
> +	if (!nodes_allowed)
> +		nodes_allowed = &node_online_map;
> +
> +	nid = this_node_allowed(h->next_nid_to_free, nodes_allowed);
> +
> +	next_nid = next_node_allowed(nid, nodes_allowed);
>  	h->next_nid_to_free = next_nid;
> +
>  	return nid;
>  }

Same.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] hugetlb:  derive huge pages nodes allowed from task mempolicy
  2009-08-24 19:27 ` [PATCH 3/5] hugetlb: derive huge pages nodes allowed from task mempolicy Lee Schermerhorn
@ 2009-08-25  8:47     ` David Rientjes
  2009-08-25 10:22     ` Mel Gorman
  1 sibling, 0 replies; 51+ messages in thread
From: David Rientjes @ 2009-08-25  8:47 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney

On Mon, 24 Aug 2009, Lee Schermerhorn wrote:

> This patch derives a "nodes_allowed" node mask from the numa
> mempolicy of the task modifying the number of persistent huge
> pages to control the allocation, freeing and adjusting of surplus
> huge pages.  This mask is derived as follows:
> 
> * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
>   is produced.  This will cause the hugetlb subsystem to use
>   node_online_map as the "nodes_allowed".  This preserves the
>   behavior before this patch.
> * For "preferred" mempolicy, including explicit local allocation,
>   a nodemask with the single preferred node will be produced. 
>   "local" policy will NOT track any internode migrations of the
>   task adjusting nr_hugepages.
> * For "bind" and "interleave" policy, the mempolicy's nodemask
>   will be used.
> * Other than to inform the construction of the nodes_allowed node
>   mask, the actual mempolicy mode is ignored.  That is, all modes
>   behave like interleave over the resulting nodes_allowed mask
>   with no "fallback".
> 
> Notes:
> 
> 1) This patch introduces a subtle change in behavior:  huge page
>    allocation and freeing will be constrained by any mempolicy
>    that the task adjusting the huge page pool inherits from its
>    parent.  This policy could come from a distant ancestor.  The
>    adminstrator adjusting the huge page pool without explicitly
>    specifying a mempolicy via numactl might be surprised by this.
>    Additionaly, any mempolicy specified by numactl will be
>    constrained by the cpuset in which numactl is invoked.
> 
> 2) Hugepages allocated at boot time use the node_online_map.
>    An additional patch could implement a temporary boot time
>    huge pages nodes_allowed command line parameter.
> 
> 3) Using mempolicy to control persistent huge page allocation
>    and freeing requires no change to hugeadm when invoking
>    it via numactl, as shown in the examples below.  However,
>    hugeadm could be enhanced to take the allowed nodes as an
>    argument and set its task mempolicy itself.  This would allow
>    it to detect and warn about any non-default mempolicy that it
>    inherited from its parent, thus alleviating the issue described
>    in Note 1 above.
> 
> See the updated documentation [next patch] for more information
> about the implications of this patch.
> 
> Examples:
> 
> Starting with:
> 
> 	Node 0 HugePages_Total:     0
> 	Node 1 HugePages_Total:     0
> 	Node 2 HugePages_Total:     0
> 	Node 3 HugePages_Total:     0
> 
> Default behavior [with or without this patch] balances persistent
> hugepage allocation across nodes [with sufficient contiguous memory]:
> 
> 	hugeadm --pool-pages-min=2048Kb:32
> 
> yields:
> 
> 	Node 0 HugePages_Total:     8
> 	Node 1 HugePages_Total:     8
> 	Node 2 HugePages_Total:     8
> 	Node 3 HugePages_Total:     8
> 
> Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
> '--membind' because it allows multiple nodes to be specified
> and it's easy to type]--we can allocate huge pages on
> individual nodes or sets of nodes.  So, starting from the 
> condition above, with 8 huge pages per node:
> 
> 	numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8
> 
> yields:
> 
> 	Node 0 HugePages_Total:     8
> 	Node 1 HugePages_Total:     8
> 	Node 2 HugePages_Total:    16
> 	Node 3 HugePages_Total:     8
> 
> The incremental 8 huge pages were restricted to node 2 by the
> specified mempolicy.
> 
> Similarly, we can use mempolicy to free persistent huge pages
> from specified nodes:
> 
> 	numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8
> 
> yields:
> 
> 	Node 0 HugePages_Total:     4
> 	Node 1 HugePages_Total:     4
> 	Node 2 HugePages_Total:    16
> 	Node 3 HugePages_Total:     8
> 
> The 8 huge pages freed were balanced over nodes 0 and 1.
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  include/linux/mempolicy.h |    3 ++
>  mm/hugetlb.c              |   14 ++++++----
>  mm/mempolicy.c            |   61 ++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 73 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/mempolicy.c	2009-08-24 12:12:44.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c	2009-08-24 12:12:53.000000000 -0400
> @@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm
>  	}
>  	return zl;
>  }
> +
> +/*
> + * huge_mpol_nodes_allowed -- mempolicy extension for huge pages.
> + *
> + * Returns a [pointer to a] nodelist based on the current task's mempolicy
> + * to constraing the allocation and freeing of persistent huge pages
> + * 'Preferred', 'local' and 'interleave' mempolicy will behave more like
> + * 'bind' policy in this context.  An attempt to allocate a persistent huge
> + * page will never "fallback" to another node inside the buddy system
> + * allocator.
> + *
> + * If the task's mempolicy is "default" [NULL], just return NULL for
> + * default behavior.  Otherwise, extract the policy nodemask for 'bind'
> + * or 'interleave' policy or construct a nodemask for 'preferred' or
> + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t.
> + *
> + * N.B., it is the caller's responsibility to free a returned nodemask.
> + */
> +nodemask_t *huge_mpol_nodes_allowed(void)
> +{
> +	nodemask_t *nodes_allowed = NULL;
> +	struct mempolicy *mempolicy;
> +	int nid;
> +
> +	if (!current->mempolicy)
> +		return NULL;
> +
> +	mpol_get(current->mempolicy);
> +	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
> +	if (!nodes_allowed) {
> +		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
> +			"for huge page allocation.\nFalling back to default.\n",
> +			current->comm);

I don't think using '\n' inside printk's is allowed anymore.

> +		goto out;
> +	}
> +	nodes_clear(*nodes_allowed);
> +
> +	mempolicy = current->mempolicy;
> +	switch (mempolicy->mode) {
> +	case MPOL_PREFERRED:
> +		if (mempolicy->flags & MPOL_F_LOCAL)
> +			nid = numa_node_id();
> +		else
> +			nid = mempolicy->v.preferred_node;
> +		node_set(nid, *nodes_allowed);
> +		break;
> +
> +	case MPOL_BIND:
> +		/* Fall through */
> +	case MPOL_INTERLEAVE:
> +		*nodes_allowed =  mempolicy->v.nodes;
> +		break;
> +
> +	default:
> +		BUG();
> +	}
> +
> +out:
> +	mpol_put(current->mempolicy);
> +	return nodes_allowed;
> +}

This should be all unnecessary, see below.

>  #endif
>  
>  /* Allocate a page in interleaved policy.
> Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/mempolicy.h	2009-08-24 12:12:44.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h	2009-08-24 12:12:53.000000000 -0400
> @@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str
>  extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
>  				unsigned long addr, gfp_t gfp_flags,
>  				struct mempolicy **mpol, nodemask_t **nodemask);
> +extern nodemask_t *huge_mpol_nodes_allowed(void);
>  extern unsigned slab_node(struct mempolicy *policy);
>  
>  extern enum zone_type policy_zone;
> @@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone
>  	return node_zonelist(0, gfp_flags);
>  }
>  
> +static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; }
> +
>  static inline int do_migrate_pages(struct mm_struct *mm,
>  			const nodemask_t *from_nodes,
>  			const nodemask_t *to_nodes, int flags)
> Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-24 12:12:50.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-24 12:12:53.000000000 -0400
> @@ -1257,10 +1257,13 @@ static int adjust_pool_surplus(struct hs
>  static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
>  {
>  	unsigned long min_count, ret;
> +	nodemask_t *nodes_allowed;
>  
>  	if (h->order >= MAX_ORDER)
>  		return h->max_huge_pages;
>  

Why can't you simply do this?

	struct mempolicy *pol = NULL;
	nodemask_t *nodes_allowed = &node_online_map;

	local_irq_disable();
	pol = current->mempolicy;
	mpol_get(pol);
	local_irq_enable();
	if (pol) {
		switch (pol->mode) {
		case MPOL_BIND:
		case MPOL_INTERLEAVE:
			nodes_allowed = pol->v.nodes;
			break;
		case MPOL_PREFERRED:
			... use NODEMASK_SCRATCH() ...
		default:
			BUG();
		}
	}
	mpol_put(pol);

and then use nodes_allowed throughout set_max_huge_pages()?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] hugetlb:  derive huge pages nodes allowed from task mempolicy
@ 2009-08-25  8:47     ` David Rientjes
  0 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2009-08-25  8:47 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney

On Mon, 24 Aug 2009, Lee Schermerhorn wrote:

> This patch derives a "nodes_allowed" node mask from the numa
> mempolicy of the task modifying the number of persistent huge
> pages to control the allocation, freeing and adjusting of surplus
> huge pages.  This mask is derived as follows:
> 
> * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
>   is produced.  This will cause the hugetlb subsystem to use
>   node_online_map as the "nodes_allowed".  This preserves the
>   behavior before this patch.
> * For "preferred" mempolicy, including explicit local allocation,
>   a nodemask with the single preferred node will be produced. 
>   "local" policy will NOT track any internode migrations of the
>   task adjusting nr_hugepages.
> * For "bind" and "interleave" policy, the mempolicy's nodemask
>   will be used.
> * Other than to inform the construction of the nodes_allowed node
>   mask, the actual mempolicy mode is ignored.  That is, all modes
>   behave like interleave over the resulting nodes_allowed mask
>   with no "fallback".
> 
> Notes:
> 
> 1) This patch introduces a subtle change in behavior:  huge page
>    allocation and freeing will be constrained by any mempolicy
>    that the task adjusting the huge page pool inherits from its
>    parent.  This policy could come from a distant ancestor.  The
>    adminstrator adjusting the huge page pool without explicitly
>    specifying a mempolicy via numactl might be surprised by this.
>    Additionaly, any mempolicy specified by numactl will be
>    constrained by the cpuset in which numactl is invoked.
> 
> 2) Hugepages allocated at boot time use the node_online_map.
>    An additional patch could implement a temporary boot time
>    huge pages nodes_allowed command line parameter.
> 
> 3) Using mempolicy to control persistent huge page allocation
>    and freeing requires no change to hugeadm when invoking
>    it via numactl, as shown in the examples below.  However,
>    hugeadm could be enhanced to take the allowed nodes as an
>    argument and set its task mempolicy itself.  This would allow
>    it to detect and warn about any non-default mempolicy that it
>    inherited from its parent, thus alleviating the issue described
>    in Note 1 above.
> 
> See the updated documentation [next patch] for more information
> about the implications of this patch.
> 
> Examples:
> 
> Starting with:
> 
> 	Node 0 HugePages_Total:     0
> 	Node 1 HugePages_Total:     0
> 	Node 2 HugePages_Total:     0
> 	Node 3 HugePages_Total:     0
> 
> Default behavior [with or without this patch] balances persistent
> hugepage allocation across nodes [with sufficient contiguous memory]:
> 
> 	hugeadm --pool-pages-min=2048Kb:32
> 
> yields:
> 
> 	Node 0 HugePages_Total:     8
> 	Node 1 HugePages_Total:     8
> 	Node 2 HugePages_Total:     8
> 	Node 3 HugePages_Total:     8
> 
> Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
> '--membind' because it allows multiple nodes to be specified
> and it's easy to type]--we can allocate huge pages on
> individual nodes or sets of nodes.  So, starting from the 
> condition above, with 8 huge pages per node:
> 
> 	numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8
> 
> yields:
> 
> 	Node 0 HugePages_Total:     8
> 	Node 1 HugePages_Total:     8
> 	Node 2 HugePages_Total:    16
> 	Node 3 HugePages_Total:     8
> 
> The incremental 8 huge pages were restricted to node 2 by the
> specified mempolicy.
> 
> Similarly, we can use mempolicy to free persistent huge pages
> from specified nodes:
> 
> 	numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8
> 
> yields:
> 
> 	Node 0 HugePages_Total:     4
> 	Node 1 HugePages_Total:     4
> 	Node 2 HugePages_Total:    16
> 	Node 3 HugePages_Total:     8
> 
> The 8 huge pages freed were balanced over nodes 0 and 1.
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  include/linux/mempolicy.h |    3 ++
>  mm/hugetlb.c              |   14 ++++++----
>  mm/mempolicy.c            |   61 ++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 73 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/mempolicy.c	2009-08-24 12:12:44.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c	2009-08-24 12:12:53.000000000 -0400
> @@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm
>  	}
>  	return zl;
>  }
> +
> +/*
> + * huge_mpol_nodes_allowed -- mempolicy extension for huge pages.
> + *
> + * Returns a [pointer to a] nodelist based on the current task's mempolicy
> + * to constraing the allocation and freeing of persistent huge pages
> + * 'Preferred', 'local' and 'interleave' mempolicy will behave more like
> + * 'bind' policy in this context.  An attempt to allocate a persistent huge
> + * page will never "fallback" to another node inside the buddy system
> + * allocator.
> + *
> + * If the task's mempolicy is "default" [NULL], just return NULL for
> + * default behavior.  Otherwise, extract the policy nodemask for 'bind'
> + * or 'interleave' policy or construct a nodemask for 'preferred' or
> + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t.
> + *
> + * N.B., it is the caller's responsibility to free a returned nodemask.
> + */
> +nodemask_t *huge_mpol_nodes_allowed(void)
> +{
> +	nodemask_t *nodes_allowed = NULL;
> +	struct mempolicy *mempolicy;
> +	int nid;
> +
> +	if (!current->mempolicy)
> +		return NULL;
> +
> +	mpol_get(current->mempolicy);
> +	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
> +	if (!nodes_allowed) {
> +		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
> +			"for huge page allocation.\nFalling back to default.\n",
> +			current->comm);

I don't think using '\n' inside printk's is allowed anymore.

> +		goto out;
> +	}
> +	nodes_clear(*nodes_allowed);
> +
> +	mempolicy = current->mempolicy;
> +	switch (mempolicy->mode) {
> +	case MPOL_PREFERRED:
> +		if (mempolicy->flags & MPOL_F_LOCAL)
> +			nid = numa_node_id();
> +		else
> +			nid = mempolicy->v.preferred_node;
> +		node_set(nid, *nodes_allowed);
> +		break;
> +
> +	case MPOL_BIND:
> +		/* Fall through */
> +	case MPOL_INTERLEAVE:
> +		*nodes_allowed =  mempolicy->v.nodes;
> +		break;
> +
> +	default:
> +		BUG();
> +	}
> +
> +out:
> +	mpol_put(current->mempolicy);
> +	return nodes_allowed;
> +}

This should be all unnecessary, see below.

>  #endif
>  
>  /* Allocate a page in interleaved policy.
> Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/mempolicy.h	2009-08-24 12:12:44.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h	2009-08-24 12:12:53.000000000 -0400
> @@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str
>  extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
>  				unsigned long addr, gfp_t gfp_flags,
>  				struct mempolicy **mpol, nodemask_t **nodemask);
> +extern nodemask_t *huge_mpol_nodes_allowed(void);
>  extern unsigned slab_node(struct mempolicy *policy);
>  
>  extern enum zone_type policy_zone;
> @@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone
>  	return node_zonelist(0, gfp_flags);
>  }
>  
> +static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; }
> +
>  static inline int do_migrate_pages(struct mm_struct *mm,
>  			const nodemask_t *from_nodes,
>  			const nodemask_t *to_nodes, int flags)
> Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-24 12:12:50.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-24 12:12:53.000000000 -0400
> @@ -1257,10 +1257,13 @@ static int adjust_pool_surplus(struct hs
>  static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
>  {
>  	unsigned long min_count, ret;
> +	nodemask_t *nodes_allowed;
>  
>  	if (h->order >= MAX_ORDER)
>  		return h->max_huge_pages;
>  

Why can't you simply do this?

	struct mempolicy *pol = NULL;
	nodemask_t *nodes_allowed = &node_online_map;

	local_irq_disable();
	pol = current->mempolicy;
	mpol_get(pol);
	local_irq_enable();
	if (pol) {
		switch (pol->mode) {
		case MPOL_BIND:
		case MPOL_INTERLEAVE:
			nodes_allowed = pol->v.nodes;
			break;
		case MPOL_PREFERRED:
			... use NODEMASK_SCRATCH() ...
		default:
			BUG();
		}
	}
	mpol_put(pol);

and then use nodes_allowed throughout set_max_huge_pages()?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
  2009-08-24 19:29 ` [PATCH 4/5] hugetlb: add per node hstate attributes Lee Schermerhorn
@ 2009-08-25 10:19     ` Mel Gorman
  2009-08-25 13:35     ` Mel Gorman
  1 sibling, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2009-08-25 10:19 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote:
> PATCH/RFC 5/4 hugetlb:  register per node hugepages attributes
> 
> Against: 2.6.31-rc6-mmotm-090820-1918
> 
> V2:  remove dependency on kobject private bitfield.  Search
>      global hstates then all per node hstates for kobject
>      match in attribute show/store functions.
> 
> V3:  rebase atop the mempolicy-based hugepage alloc/free;
>      use custom "nodes_allowed" to restrict alloc/free to
>      a specific node via per node attributes.  Per node
>      attribute overrides mempolicy.  I.e., mempolicy only
>      applies to global attributes.
> 
> To demonstrate feasibility--if not advisability--of supporting
> both mempolicy-based persistent huge page management with per
> node "override" attributes.
> 
> This patch adds the per huge page size control/query attributes
> to the per node sysdevs:
> 
> /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
> 	nr_hugepages       - r/w
> 	free_huge_pages    - r/o
> 	surplus_huge_pages - r/o
> 
> The patch attempts to re-use/share as much of the existing
> global hstate attribute initialization and handling, and the
> "nodes_allowed" constraint processing as possible.
> In set_max_huge_pages(), a node id < 0 indicates a change to
> global hstate parameters.  In this case, any non-default task
> mempolicy will be used to generate the nodes_allowed mask.  A
> node id > 0 indicates a node specific update and the count 
> argument specifies the target count for the node.  From this
> info, we compute the target global count for the hstate and
> construct a nodes_allowed node mask contain only the specified
> node.  Thus, setting the node specific nr_hugepages via the
> per node attribute effectively overrides any task mempolicy.
> 
> 
> Issue:  dependency of base driver [node] dependency on hugetlbfs module.
> We want to keep all of the hstate attribute registration and handling
> in the hugetlb module.  However, we need to call into this code to
> register the per node hstate attributes on node hot plug.
> 
> With this patch:
> 
> (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
> ./  ../  free_hugepages  nr_hugepages  surplus_hugepages
> 
> Starting from:
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:     0
> Node 2 HugePages_Free:      0
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
> vm.nr_hugepages = 0
> 
> Allocate 16 persistent huge pages on node 2:
> (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
> 
> [Note that this is equivalent to:
> 	numactl -m 2 hugeadmin --pool-pages-min 2M:+16
> ]
> 
> Yields:
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:    16
> Node 2 HugePages_Free:     16
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
> vm.nr_hugepages = 16
> 
> Global controls work as expected--reduce pool to 8 persistent huge pages:
> (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> 
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:     8
> Node 2 HugePages_Free:      8
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
> 
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  drivers/base/node.c     |    2 
>  include/linux/hugetlb.h |    6 +
>  include/linux/node.h    |    3 
>  mm/hugetlb.c            |  213 +++++++++++++++++++++++++++++++++++++++++-------
>  4 files changed, 197 insertions(+), 27 deletions(-)
> 
> Index: linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/drivers/base/node.c	2009-08-24 12:12:44.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c	2009-08-24 12:12:56.000000000 -0400
> @@ -200,6 +200,7 @@ int register_node(struct node *node, int
>  		sysdev_create_file(&node->sysdev, &attr_distance);
>  
>  		scan_unevictable_register_node(node);
> +		hugetlb_register_node(node);
>  	}
>  	return error;
>  }
> @@ -220,6 +221,7 @@ void unregister_node(struct node *node)
>  	sysdev_remove_file(&node->sysdev, &attr_distance);
>  
>  	scan_unevictable_unregister_node(node);
> +	hugetlb_unregister_node(node);
>  
>  	sysdev_unregister(&node->sysdev);
>  }
> Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/hugetlb.h	2009-08-24 12:12:44.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h	2009-08-24 12:12:56.000000000 -0400
> @@ -278,6 +278,10 @@ static inline struct hstate *page_hstate
>  	return size_to_hstate(PAGE_SIZE << compound_order(page));
>  }
>  
> +struct node;
> +extern void hugetlb_register_node(struct node *);
> +extern void hugetlb_unregister_node(struct node *);
> +
>  #else
>  struct hstate {};
>  #define alloc_bootmem_huge_page(h) NULL
> @@ -294,6 +298,8 @@ static inline unsigned int pages_per_hug
>  {
>  	return 1;
>  }
> +#define hugetlb_register_node(NP)
> +#define hugetlb_unregister_node(NP)
>  #endif
>  

This also needs to be done for the !NUMA case. Try building without NUMA
set and you get the following with this patch applied

  CC      mm/hugetlb.o
mm/hugetlb.c: In function a??hugetlb_exita??:
mm/hugetlb.c:1629: error: implicit declaration of function a??hugetlb_unregister_all_nodesa??
mm/hugetlb.c: In function a??hugetlb_inita??:
mm/hugetlb.c:1665: error: implicit declaration of function a??hugetlb_register_all_nodesa??
make[1]: *** [mm/hugetlb.o] Error 1
make: *** [mm] Error 2


>  #endif /* _LINUX_HUGETLB_H */
> Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-24 12:12:53.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-24 12:12:56.000000000 -0400
> @@ -24,6 +24,7 @@
>  #include <asm/io.h>
>  
>  #include <linux/hugetlb.h>
> +#include <linux/node.h>
>  #include "internal.h"
>  
>  const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> @@ -1253,8 +1254,24 @@ static int adjust_pool_surplus(struct hs
>  	return ret;
>  }
>  
> +static nodemask_t *nodes_allowed_from_node(int nid)
> +{

This name is a bit weird. It's creating a nodemask with just a single
node allowed.

Is there something wrong with using the existing function
nodemask_of_node()? If stack is the problem, prehaps there is some macro
magic that would allow a nodemask to be either declared on the stack or
kmalloc'd.

> +	nodemask_t *nodes_allowed;
> +	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
> +	if (!nodes_allowed) {
> +		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
> +			"for huge page allocation.\nFalling back to default.\n",
> +			current->comm);
> +	} else {
> +		nodes_clear(*nodes_allowed);
> +		node_set(nid, *nodes_allowed);
> +	}
> +	return nodes_allowed;
> +}
> +
>  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> +								int nid)
>  {
>  	unsigned long min_count, ret;
>  	nodemask_t *nodes_allowed;
> @@ -1262,7 +1279,17 @@ static unsigned long set_max_huge_pages(
>  	if (h->order >= MAX_ORDER)
>  		return h->max_huge_pages;
>  
> -	nodes_allowed = huge_mpol_nodes_allowed();
> +	if (nid < 0)
> +		nodes_allowed = huge_mpol_nodes_allowed();

hugetlb is a bit littered with magic numbers been passed into functions.
Attempts have been made to clear them up as according as patches change
that area. Would it be possible to define something like

#define HUGETLB_OBEY_MEMPOLICY -1

for the nid here as opposed to passing in -1? I know -1 is used in the page
allocator functions but there it means "current node" and here it means
"obey mempolicies".

> +	else {
> +		/*
> +		 * incoming 'count' is for node 'nid' only, so
> +		 * adjust count to global, but restrict alloc/free
> +		 * to the specified node.
> +		 */
> +		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> +		nodes_allowed = nodes_allowed_from_node(nid);
> +	}
>  
>  	/*
>  	 * Increase the pool size
> @@ -1338,34 +1365,69 @@ out:
>  static struct kobject *hugepages_kobj;
>  static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
>  
> -static struct hstate *kobj_to_hstate(struct kobject *kobj)
> +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> +{
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		struct node *node = &node_devices[nid];
> +		int hi;
> +		for (hi = 0; hi < HUGE_MAX_HSTATE; hi++)

Does that hi mean hello, high, nid or hstate_idx?

hstate_idx would appear to be the appropriate name here.

> +			if (node->hstate_kobjs[hi] == kobj) {
> +				if (nidp)
> +					*nidp = nid;
> +				return &hstates[hi];
> +			}
> +	}

Ok.... so, there is a struct node array for the sysdev and this patch adds
references to the "hugepages" directory kobject and the subdirectories for
each page size. We walk all the objects until we find a match. Obviously,
this adds a dependency of base node support on hugetlbfs which feels backwards
and you call that out in your leader.

Can this be the other way around? i.e. The struct hstate has an array of
kobjects arranged by nid that is filled in when the node is registered?
There will only be one kobject-per-pagesize-per-node so it seems like it
would work. I confess, I haven't prototyped this to be 100% sure.

> +
> +	BUG();
> +	return NULL;
> +}
> +
> +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
>  {
>  	int i;
> +
>  	for (i = 0; i < HUGE_MAX_HSTATE; i++)
> -		if (hstate_kobjs[i] == kobj)
> +		if (hstate_kobjs[i] == kobj) {
> +			if (nidp)
> +				*nidp = -1;
>  			return &hstates[i];
> -	BUG();
> -	return NULL;
> +		}
> +
> +	return kobj_to_node_hstate(kobj, nidp);
>  }
>  
>  static ssize_t nr_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> -	return sprintf(buf, "%lu\n", h->nr_huge_pages);
> +	struct hstate *h;
> +	unsigned long nr_huge_pages;
> +	int nid;
> +
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (nid < 0)
> +		nr_huge_pages = h->nr_huge_pages;

Here is another magic number except it means something slightly
different. It means NR_GLOBAL_HUGEPAGES or something similar. It would
be nice if these different special nid values could be named, preferably
collapsed to being one "core" thing.

> +	else
> +		nr_huge_pages = h->nr_huge_pages_node[nid];
> +
> +	return sprintf(buf, "%lu\n", nr_huge_pages);
>  }
> +
>  static ssize_t nr_hugepages_store(struct kobject *kobj,
>  		struct kobj_attribute *attr, const char *buf, size_t count)
>  {
> -	int err;
>  	unsigned long input;
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h;
> +	int nid;
> +	int err;
>  
>  	err = strict_strtoul(buf, 10, &input);
>  	if (err)
>  		return 0;
>  
> -	h->max_huge_pages = set_max_huge_pages(h, input);

"input" is a bit meaningless. The function you are passing to calls this
parameter "count". Can you match the naming please? Otherwise, I might
guess that this is a "delta" which occurs elsewhere in the hugetlb code.

> +	h = kobj_to_hstate(kobj, &nid);
> +	h->max_huge_pages = set_max_huge_pages(h, input, nid);
>  
>  	return count;
>  }
> @@ -1374,15 +1436,17 @@ HSTATE_ATTR(nr_hugepages);
>  static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> +
>  	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
>  }
> +
>  static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
>  		struct kobj_attribute *attr, const char *buf, size_t count)
>  {
>  	int err;
>  	unsigned long input;
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h = kobj_to_hstate(kobj, NULL);
>  
>  	err = strict_strtoul(buf, 10, &input);
>  	if (err)
> @@ -1399,15 +1463,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
>  static ssize_t free_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> -	return sprintf(buf, "%lu\n", h->free_huge_pages);
> +	struct hstate *h;
> +	unsigned long free_huge_pages;
> +	int nid;
> +
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (nid < 0)
> +		free_huge_pages = h->free_huge_pages;
> +	else
> +		free_huge_pages = h->free_huge_pages_node[nid];
> +
> +	return sprintf(buf, "%lu\n", free_huge_pages);
>  }
>  HSTATE_ATTR_RO(free_hugepages);
>  
>  static ssize_t resv_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h = kobj_to_hstate(kobj, NULL);
>  	return sprintf(buf, "%lu\n", h->resv_huge_pages);
>  }
>  HSTATE_ATTR_RO(resv_hugepages);
> @@ -1415,8 +1488,17 @@ HSTATE_ATTR_RO(resv_hugepages);
>  static ssize_t surplus_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> -	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
> +	struct hstate *h;
> +	unsigned long surplus_huge_pages;
> +	int nid;
> +
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (nid < 0)
> +		surplus_huge_pages = h->surplus_huge_pages;
> +	else
> +		surplus_huge_pages = h->surplus_huge_pages_node[nid];
> +
> +	return sprintf(buf, "%lu\n", surplus_huge_pages);
>  }
>  HSTATE_ATTR_RO(surplus_hugepages);
>  
> @@ -1433,19 +1515,21 @@ static struct attribute_group hstate_att
>  	.attrs = hstate_attrs,
>  };
>  
> -static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
> +static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
> +				struct kobject *parent,
> +				struct kobject **hstate_kobjs,
> +				struct attribute_group *hstate_attr_group)
>  {
>  	int retval;
> +	int hi = h - hstates;
>  
> -	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
> -							hugepages_kobj);
> -	if (!hstate_kobjs[h - hstates])
> +	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
> +	if (!hstate_kobjs[hi])
>  		return -ENOMEM;
>  
> -	retval = sysfs_create_group(hstate_kobjs[h - hstates],
> -							&hstate_attr_group);
> +	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
>  	if (retval)
> -		kobject_put(hstate_kobjs[h - hstates]);
> +		kobject_put(hstate_kobjs[hi]);
>  
>  	return retval;
>  }
> @@ -1460,17 +1544,90 @@ static void __init hugetlb_sysfs_init(vo
>  		return;
>  
>  	for_each_hstate(h) {
> -		err = hugetlb_sysfs_add_hstate(h);
> +		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
> +					 hstate_kobjs, &hstate_attr_group);
>  		if (err)
>  			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
>  								h->name);
>  	}
>  }
>  
> +#ifdef CONFIG_NUMA
> +static struct attribute *per_node_hstate_attrs[] = {
> +	&nr_hugepages_attr.attr,
> +	&free_hugepages_attr.attr,
> +	&surplus_hugepages_attr.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group per_node_hstate_attr_group = {
> +	.attrs = per_node_hstate_attrs,
> +};
> +
> +
> +void hugetlb_unregister_node(struct node *node)
> +{
> +	struct hstate *h;
> +
> +	for_each_hstate(h) {
> +		kobject_put(node->hstate_kobjs[h - hstates]);
> +		node->hstate_kobjs[h - hstates] = NULL;
> +	}
> +
> +	kobject_put(node->hugepages_kobj);
> +	node->hugepages_kobj = NULL;
> +}
> +
> +static void hugetlb_unregister_all_nodes(void)
> +{
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++)
> +		hugetlb_unregister_node(&node_devices[nid]);
> +}
> +
> +void hugetlb_register_node(struct node *node)
> +{
> +	struct hstate *h;
> +	int err;
> +
> +	if (!hugepages_kobj)
> +		return;		/* too early */
> +
> +	node->hugepages_kobj = kobject_create_and_add("hugepages",
> +							&node->sysdev.kobj);
> +	if (!node->hugepages_kobj)
> +		return;
> +
> +	for_each_hstate(h) {
> +		err = hugetlb_sysfs_add_hstate(h, node->hugepages_kobj,
> +						node->hstate_kobjs,
> +						&per_node_hstate_attr_group);
> +		if (err)
> +			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
> +					" for node %d\n",
> +						h->name, node->sysdev.id);
> +	}
> +}
> +
> +static void hugetlb_register_all_nodes(void)
> +{
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		struct node *node = &node_devices[nid];
> +		if (node->sysdev.id == nid && !node->hugepages_kobj)
> +			hugetlb_register_node(node);
> +	}
> +}
> +#endif
> +
>  static void __exit hugetlb_exit(void)
>  {
>  	struct hstate *h;
>  
> +	hugetlb_unregister_all_nodes();
> +
>  	for_each_hstate(h) {
>  		kobject_put(hstate_kobjs[h - hstates]);
>  	}
> @@ -1505,6 +1662,8 @@ static int __init hugetlb_init(void)
>  
>  	hugetlb_sysfs_init();
>  
> +	hugetlb_register_all_nodes();
> +
>  	return 0;
>  }
>  module_init(hugetlb_init);
> @@ -1607,7 +1766,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
>  	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
>  
>  	if (write)
> -		h->max_huge_pages = set_max_huge_pages(h, tmp);
> +		h->max_huge_pages = set_max_huge_pages(h, tmp, -1);
>  
>  	return 0;
>  }
> Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h	2009-08-24 12:12:44.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h	2009-08-24 12:12:56.000000000 -0400
> @@ -21,9 +21,12 @@
>  
>  #include <linux/sysdev.h>
>  #include <linux/cpumask.h>
> +#include <linux/hugetlb.h>
>  
>  struct node {
>  	struct sys_device	sysdev;
> +	struct kobject		*hugepages_kobj;
> +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
>  };
>  
>  struct memory_block;
> 

I'm not against this idea and think it can work side-by-side with the memory
policies. I believe it does need a bit more cleaning up before merging
though. I also wasn't able to test this yet due to various build and
deploy issues.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
@ 2009-08-25 10:19     ` Mel Gorman
  0 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2009-08-25 10:19 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote:
> PATCH/RFC 5/4 hugetlb:  register per node hugepages attributes
> 
> Against: 2.6.31-rc6-mmotm-090820-1918
> 
> V2:  remove dependency on kobject private bitfield.  Search
>      global hstates then all per node hstates for kobject
>      match in attribute show/store functions.
> 
> V3:  rebase atop the mempolicy-based hugepage alloc/free;
>      use custom "nodes_allowed" to restrict alloc/free to
>      a specific node via per node attributes.  Per node
>      attribute overrides mempolicy.  I.e., mempolicy only
>      applies to global attributes.
> 
> To demonstrate feasibility--if not advisability--of supporting
> both mempolicy-based persistent huge page management with per
> node "override" attributes.
> 
> This patch adds the per huge page size control/query attributes
> to the per node sysdevs:
> 
> /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
> 	nr_hugepages       - r/w
> 	free_huge_pages    - r/o
> 	surplus_huge_pages - r/o
> 
> The patch attempts to re-use/share as much of the existing
> global hstate attribute initialization and handling, and the
> "nodes_allowed" constraint processing as possible.
> In set_max_huge_pages(), a node id < 0 indicates a change to
> global hstate parameters.  In this case, any non-default task
> mempolicy will be used to generate the nodes_allowed mask.  A
> node id > 0 indicates a node specific update and the count 
> argument specifies the target count for the node.  From this
> info, we compute the target global count for the hstate and
> construct a nodes_allowed node mask contain only the specified
> node.  Thus, setting the node specific nr_hugepages via the
> per node attribute effectively overrides any task mempolicy.
> 
> 
> Issue:  dependency of base driver [node] dependency on hugetlbfs module.
> We want to keep all of the hstate attribute registration and handling
> in the hugetlb module.  However, we need to call into this code to
> register the per node hstate attributes on node hot plug.
> 
> With this patch:
> 
> (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
> ./  ../  free_hugepages  nr_hugepages  surplus_hugepages
> 
> Starting from:
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:     0
> Node 2 HugePages_Free:      0
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
> vm.nr_hugepages = 0
> 
> Allocate 16 persistent huge pages on node 2:
> (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
> 
> [Note that this is equivalent to:
> 	numactl -m 2 hugeadmin --pool-pages-min 2M:+16
> ]
> 
> Yields:
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:    16
> Node 2 HugePages_Free:     16
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
> vm.nr_hugepages = 16
> 
> Global controls work as expected--reduce pool to 8 persistent huge pages:
> (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> 
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:     8
> Node 2 HugePages_Free:      8
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
> 
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  drivers/base/node.c     |    2 
>  include/linux/hugetlb.h |    6 +
>  include/linux/node.h    |    3 
>  mm/hugetlb.c            |  213 +++++++++++++++++++++++++++++++++++++++++-------
>  4 files changed, 197 insertions(+), 27 deletions(-)
> 
> Index: linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/drivers/base/node.c	2009-08-24 12:12:44.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c	2009-08-24 12:12:56.000000000 -0400
> @@ -200,6 +200,7 @@ int register_node(struct node *node, int
>  		sysdev_create_file(&node->sysdev, &attr_distance);
>  
>  		scan_unevictable_register_node(node);
> +		hugetlb_register_node(node);
>  	}
>  	return error;
>  }
> @@ -220,6 +221,7 @@ void unregister_node(struct node *node)
>  	sysdev_remove_file(&node->sysdev, &attr_distance);
>  
>  	scan_unevictable_unregister_node(node);
> +	hugetlb_unregister_node(node);
>  
>  	sysdev_unregister(&node->sysdev);
>  }
> Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/hugetlb.h	2009-08-24 12:12:44.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h	2009-08-24 12:12:56.000000000 -0400
> @@ -278,6 +278,10 @@ static inline struct hstate *page_hstate
>  	return size_to_hstate(PAGE_SIZE << compound_order(page));
>  }
>  
> +struct node;
> +extern void hugetlb_register_node(struct node *);
> +extern void hugetlb_unregister_node(struct node *);
> +
>  #else
>  struct hstate {};
>  #define alloc_bootmem_huge_page(h) NULL
> @@ -294,6 +298,8 @@ static inline unsigned int pages_per_hug
>  {
>  	return 1;
>  }
> +#define hugetlb_register_node(NP)
> +#define hugetlb_unregister_node(NP)
>  #endif
>  

This also needs to be done for the !NUMA case. Try building without NUMA
set and you get the following with this patch applied

  CC      mm/hugetlb.o
mm/hugetlb.c: In function ‘hugetlb_exit’:
mm/hugetlb.c:1629: error: implicit declaration of function ‘hugetlb_unregister_all_nodes’
mm/hugetlb.c: In function ‘hugetlb_init’:
mm/hugetlb.c:1665: error: implicit declaration of function ‘hugetlb_register_all_nodes’
make[1]: *** [mm/hugetlb.o] Error 1
make: *** [mm] Error 2


>  #endif /* _LINUX_HUGETLB_H */
> Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-24 12:12:53.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-24 12:12:56.000000000 -0400
> @@ -24,6 +24,7 @@
>  #include <asm/io.h>
>  
>  #include <linux/hugetlb.h>
> +#include <linux/node.h>
>  #include "internal.h"
>  
>  const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> @@ -1253,8 +1254,24 @@ static int adjust_pool_surplus(struct hs
>  	return ret;
>  }
>  
> +static nodemask_t *nodes_allowed_from_node(int nid)
> +{

This name is a bit weird. It's creating a nodemask with just a single
node allowed.

Is there something wrong with using the existing function
nodemask_of_node()? If stack is the problem, prehaps there is some macro
magic that would allow a nodemask to be either declared on the stack or
kmalloc'd.

> +	nodemask_t *nodes_allowed;
> +	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
> +	if (!nodes_allowed) {
> +		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
> +			"for huge page allocation.\nFalling back to default.\n",
> +			current->comm);
> +	} else {
> +		nodes_clear(*nodes_allowed);
> +		node_set(nid, *nodes_allowed);
> +	}
> +	return nodes_allowed;
> +}
> +
>  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> +								int nid)
>  {
>  	unsigned long min_count, ret;
>  	nodemask_t *nodes_allowed;
> @@ -1262,7 +1279,17 @@ static unsigned long set_max_huge_pages(
>  	if (h->order >= MAX_ORDER)
>  		return h->max_huge_pages;
>  
> -	nodes_allowed = huge_mpol_nodes_allowed();
> +	if (nid < 0)
> +		nodes_allowed = huge_mpol_nodes_allowed();

hugetlb is a bit littered with magic numbers been passed into functions.
Attempts have been made to clear them up as according as patches change
that area. Would it be possible to define something like

#define HUGETLB_OBEY_MEMPOLICY -1

for the nid here as opposed to passing in -1? I know -1 is used in the page
allocator functions but there it means "current node" and here it means
"obey mempolicies".

> +	else {
> +		/*
> +		 * incoming 'count' is for node 'nid' only, so
> +		 * adjust count to global, but restrict alloc/free
> +		 * to the specified node.
> +		 */
> +		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> +		nodes_allowed = nodes_allowed_from_node(nid);
> +	}
>  
>  	/*
>  	 * Increase the pool size
> @@ -1338,34 +1365,69 @@ out:
>  static struct kobject *hugepages_kobj;
>  static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
>  
> -static struct hstate *kobj_to_hstate(struct kobject *kobj)
> +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> +{
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		struct node *node = &node_devices[nid];
> +		int hi;
> +		for (hi = 0; hi < HUGE_MAX_HSTATE; hi++)

Does that hi mean hello, high, nid or hstate_idx?

hstate_idx would appear to be the appropriate name here.

> +			if (node->hstate_kobjs[hi] == kobj) {
> +				if (nidp)
> +					*nidp = nid;
> +				return &hstates[hi];
> +			}
> +	}

Ok.... so, there is a struct node array for the sysdev and this patch adds
references to the "hugepages" directory kobject and the subdirectories for
each page size. We walk all the objects until we find a match. Obviously,
this adds a dependency of base node support on hugetlbfs which feels backwards
and you call that out in your leader.

Can this be the other way around? i.e. The struct hstate has an array of
kobjects arranged by nid that is filled in when the node is registered?
There will only be one kobject-per-pagesize-per-node so it seems like it
would work. I confess, I haven't prototyped this to be 100% sure.

> +
> +	BUG();
> +	return NULL;
> +}
> +
> +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
>  {
>  	int i;
> +
>  	for (i = 0; i < HUGE_MAX_HSTATE; i++)
> -		if (hstate_kobjs[i] == kobj)
> +		if (hstate_kobjs[i] == kobj) {
> +			if (nidp)
> +				*nidp = -1;
>  			return &hstates[i];
> -	BUG();
> -	return NULL;
> +		}
> +
> +	return kobj_to_node_hstate(kobj, nidp);
>  }
>  
>  static ssize_t nr_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> -	return sprintf(buf, "%lu\n", h->nr_huge_pages);
> +	struct hstate *h;
> +	unsigned long nr_huge_pages;
> +	int nid;
> +
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (nid < 0)
> +		nr_huge_pages = h->nr_huge_pages;

Here is another magic number except it means something slightly
different. It means NR_GLOBAL_HUGEPAGES or something similar. It would
be nice if these different special nid values could be named, preferably
collapsed to being one "core" thing.

> +	else
> +		nr_huge_pages = h->nr_huge_pages_node[nid];
> +
> +	return sprintf(buf, "%lu\n", nr_huge_pages);
>  }
> +
>  static ssize_t nr_hugepages_store(struct kobject *kobj,
>  		struct kobj_attribute *attr, const char *buf, size_t count)
>  {
> -	int err;
>  	unsigned long input;
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h;
> +	int nid;
> +	int err;
>  
>  	err = strict_strtoul(buf, 10, &input);
>  	if (err)
>  		return 0;
>  
> -	h->max_huge_pages = set_max_huge_pages(h, input);

"input" is a bit meaningless. The function you are passing to calls this
parameter "count". Can you match the naming please? Otherwise, I might
guess that this is a "delta" which occurs elsewhere in the hugetlb code.

> +	h = kobj_to_hstate(kobj, &nid);
> +	h->max_huge_pages = set_max_huge_pages(h, input, nid);
>  
>  	return count;
>  }
> @@ -1374,15 +1436,17 @@ HSTATE_ATTR(nr_hugepages);
>  static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> +
>  	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
>  }
> +
>  static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
>  		struct kobj_attribute *attr, const char *buf, size_t count)
>  {
>  	int err;
>  	unsigned long input;
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h = kobj_to_hstate(kobj, NULL);
>  
>  	err = strict_strtoul(buf, 10, &input);
>  	if (err)
> @@ -1399,15 +1463,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
>  static ssize_t free_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> -	return sprintf(buf, "%lu\n", h->free_huge_pages);
> +	struct hstate *h;
> +	unsigned long free_huge_pages;
> +	int nid;
> +
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (nid < 0)
> +		free_huge_pages = h->free_huge_pages;
> +	else
> +		free_huge_pages = h->free_huge_pages_node[nid];
> +
> +	return sprintf(buf, "%lu\n", free_huge_pages);
>  }
>  HSTATE_ATTR_RO(free_hugepages);
>  
>  static ssize_t resv_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h = kobj_to_hstate(kobj, NULL);
>  	return sprintf(buf, "%lu\n", h->resv_huge_pages);
>  }
>  HSTATE_ATTR_RO(resv_hugepages);
> @@ -1415,8 +1488,17 @@ HSTATE_ATTR_RO(resv_hugepages);
>  static ssize_t surplus_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> -	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
> +	struct hstate *h;
> +	unsigned long surplus_huge_pages;
> +	int nid;
> +
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (nid < 0)
> +		surplus_huge_pages = h->surplus_huge_pages;
> +	else
> +		surplus_huge_pages = h->surplus_huge_pages_node[nid];
> +
> +	return sprintf(buf, "%lu\n", surplus_huge_pages);
>  }
>  HSTATE_ATTR_RO(surplus_hugepages);
>  
> @@ -1433,19 +1515,21 @@ static struct attribute_group hstate_att
>  	.attrs = hstate_attrs,
>  };
>  
> -static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
> +static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
> +				struct kobject *parent,
> +				struct kobject **hstate_kobjs,
> +				struct attribute_group *hstate_attr_group)
>  {
>  	int retval;
> +	int hi = h - hstates;
>  
> -	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
> -							hugepages_kobj);
> -	if (!hstate_kobjs[h - hstates])
> +	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
> +	if (!hstate_kobjs[hi])
>  		return -ENOMEM;
>  
> -	retval = sysfs_create_group(hstate_kobjs[h - hstates],
> -							&hstate_attr_group);
> +	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
>  	if (retval)
> -		kobject_put(hstate_kobjs[h - hstates]);
> +		kobject_put(hstate_kobjs[hi]);
>  
>  	return retval;
>  }
> @@ -1460,17 +1544,90 @@ static void __init hugetlb_sysfs_init(vo
>  		return;
>  
>  	for_each_hstate(h) {
> -		err = hugetlb_sysfs_add_hstate(h);
> +		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
> +					 hstate_kobjs, &hstate_attr_group);
>  		if (err)
>  			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
>  								h->name);
>  	}
>  }
>  
> +#ifdef CONFIG_NUMA
> +static struct attribute *per_node_hstate_attrs[] = {
> +	&nr_hugepages_attr.attr,
> +	&free_hugepages_attr.attr,
> +	&surplus_hugepages_attr.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group per_node_hstate_attr_group = {
> +	.attrs = per_node_hstate_attrs,
> +};
> +
> +
> +void hugetlb_unregister_node(struct node *node)
> +{
> +	struct hstate *h;
> +
> +	for_each_hstate(h) {
> +		kobject_put(node->hstate_kobjs[h - hstates]);
> +		node->hstate_kobjs[h - hstates] = NULL;
> +	}
> +
> +	kobject_put(node->hugepages_kobj);
> +	node->hugepages_kobj = NULL;
> +}
> +
> +static void hugetlb_unregister_all_nodes(void)
> +{
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++)
> +		hugetlb_unregister_node(&node_devices[nid]);
> +}
> +
> +void hugetlb_register_node(struct node *node)
> +{
> +	struct hstate *h;
> +	int err;
> +
> +	if (!hugepages_kobj)
> +		return;		/* too early */
> +
> +	node->hugepages_kobj = kobject_create_and_add("hugepages",
> +							&node->sysdev.kobj);
> +	if (!node->hugepages_kobj)
> +		return;
> +
> +	for_each_hstate(h) {
> +		err = hugetlb_sysfs_add_hstate(h, node->hugepages_kobj,
> +						node->hstate_kobjs,
> +						&per_node_hstate_attr_group);
> +		if (err)
> +			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
> +					" for node %d\n",
> +						h->name, node->sysdev.id);
> +	}
> +}
> +
> +static void hugetlb_register_all_nodes(void)
> +{
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		struct node *node = &node_devices[nid];
> +		if (node->sysdev.id == nid && !node->hugepages_kobj)
> +			hugetlb_register_node(node);
> +	}
> +}
> +#endif
> +
>  static void __exit hugetlb_exit(void)
>  {
>  	struct hstate *h;
>  
> +	hugetlb_unregister_all_nodes();
> +
>  	for_each_hstate(h) {
>  		kobject_put(hstate_kobjs[h - hstates]);
>  	}
> @@ -1505,6 +1662,8 @@ static int __init hugetlb_init(void)
>  
>  	hugetlb_sysfs_init();
>  
> +	hugetlb_register_all_nodes();
> +
>  	return 0;
>  }
>  module_init(hugetlb_init);
> @@ -1607,7 +1766,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
>  	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
>  
>  	if (write)
> -		h->max_huge_pages = set_max_huge_pages(h, tmp);
> +		h->max_huge_pages = set_max_huge_pages(h, tmp, -1);
>  
>  	return 0;
>  }
> Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h	2009-08-24 12:12:44.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h	2009-08-24 12:12:56.000000000 -0400
> @@ -21,9 +21,12 @@
>  
>  #include <linux/sysdev.h>
>  #include <linux/cpumask.h>
> +#include <linux/hugetlb.h>
>  
>  struct node {
>  	struct sys_device	sysdev;
> +	struct kobject		*hugepages_kobj;
> +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
>  };
>  
>  struct memory_block;
> 

I'm not against this idea and think it can work side-by-side with the memory
policies. I believe it does need a bit more cleaning up before merging
though. I also wasn't able to test this yet due to various build and
deploy issues.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] hugetlb:  derive huge pages nodes allowed from task mempolicy
  2009-08-24 19:27 ` [PATCH 3/5] hugetlb: derive huge pages nodes allowed from task mempolicy Lee Schermerhorn
@ 2009-08-25 10:22     ` Mel Gorman
  2009-08-25 10:22     ` Mel Gorman
  1 sibling, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2009-08-25 10:22 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Mon, Aug 24, 2009 at 03:27:52PM -0400, Lee Schermerhorn wrote:
> [PATCH 3/4] hugetlb:  derive huge pages nodes allowed from task mempolicy
> 
> Against: 2.6.31-rc6-mmotm-090820-1918
> 
> V2:
> + cleaned up comments, removed some deemed unnecessary,
>   add some suggested by review
> + removed check for !current in huge_mpol_nodes_allowed().
> + added 'current->comm' to warning message in huge_mpol_nodes_allowed().
> + added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to
>   catch out of range node id.
> + add examples to patch description
> 
> V3: Factored this patch from V2 patch 2/3
> 
> V4: added back missing "kfree(nodes_allowed)" in set_max_nr_hugepages()
> 
> This patch derives a "nodes_allowed" node mask from the numa
> mempolicy of the task modifying the number of persistent huge
> pages to control the allocation, freeing and adjusting of surplus
> huge pages.  This mask is derived as follows:
> 
> * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
>   is produced.  This will cause the hugetlb subsystem to use
>   node_online_map as the "nodes_allowed".  This preserves the
>   behavior before this patch.
> * For "preferred" mempolicy, including explicit local allocation,
>   a nodemask with the single preferred node will be produced. 
>   "local" policy will NOT track any internode migrations of the
>   task adjusting nr_hugepages.
> * For "bind" and "interleave" policy, the mempolicy's nodemask
>   will be used.
> * Other than to inform the construction of the nodes_allowed node
>   mask, the actual mempolicy mode is ignored.  That is, all modes
>   behave like interleave over the resulting nodes_allowed mask
>   with no "fallback".
> 
> Notes:
> 
> 1) This patch introduces a subtle change in behavior:  huge page
>    allocation and freeing will be constrained by any mempolicy
>    that the task adjusting the huge page pool inherits from its
>    parent.  This policy could come from a distant ancestor.  The
>    adminstrator adjusting the huge page pool without explicitly
>    specifying a mempolicy via numactl might be surprised by this.
>    Additionaly, any mempolicy specified by numactl will be
>    constrained by the cpuset in which numactl is invoked.
> 
> 2) Hugepages allocated at boot time use the node_online_map.
>    An additional patch could implement a temporary boot time
>    huge pages nodes_allowed command line parameter.
> 
> 3) Using mempolicy to control persistent huge page allocation
>    and freeing requires no change to hugeadm when invoking
>    it via numactl, as shown in the examples below.  However,
>    hugeadm could be enhanced to take the allowed nodes as an
>    argument and set its task mempolicy itself.  This would allow
>    it to detect and warn about any non-default mempolicy that it
>    inherited from its parent, thus alleviating the issue described
>    in Note 1 above.
> 
> See the updated documentation [next patch] for more information
> about the implications of this patch.
> 
> Examples:
> 
> Starting with:
> 
> 	Node 0 HugePages_Total:     0
> 	Node 1 HugePages_Total:     0
> 	Node 2 HugePages_Total:     0
> 	Node 3 HugePages_Total:     0
> 
> Default behavior [with or without this patch] balances persistent
> hugepage allocation across nodes [with sufficient contiguous memory]:
> 
> 	hugeadm --pool-pages-min=2048Kb:32
> 
> yields:
> 
> 	Node 0 HugePages_Total:     8
> 	Node 1 HugePages_Total:     8
> 	Node 2 HugePages_Total:     8
> 	Node 3 HugePages_Total:     8
> 
> Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
> '--membind' because it allows multiple nodes to be specified
> and it's easy to type]--we can allocate huge pages on
> individual nodes or sets of nodes.  So, starting from the 
> condition above, with 8 huge pages per node:
> 
> 	numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8
> 
> yields:
> 
> 	Node 0 HugePages_Total:     8
> 	Node 1 HugePages_Total:     8
> 	Node 2 HugePages_Total:    16
> 	Node 3 HugePages_Total:     8
> 
> The incremental 8 huge pages were restricted to node 2 by the
> specified mempolicy.
> 
> Similarly, we can use mempolicy to free persistent huge pages
> from specified nodes:
> 
> 	numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8
> 
> yields:
> 
> 	Node 0 HugePages_Total:     4
> 	Node 1 HugePages_Total:     4
> 	Node 2 HugePages_Total:    16
> 	Node 3 HugePages_Total:     8
> 
> The 8 huge pages freed were balanced over nodes 0 and 1.
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

I haven't been able to test this yet because of some build and deploy
issues but I didn't spot anything wrong when eyeballing the patch. For
the moment;

Acked-by: Mel Gorman <mel@csn.ul.ie>

> 
>  include/linux/mempolicy.h |    3 ++
>  mm/hugetlb.c              |   14 ++++++----
>  mm/mempolicy.c            |   61 ++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 73 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/mempolicy.c	2009-08-24 12:12:44.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c	2009-08-24 12:12:53.000000000 -0400
> @@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm
>  	}
>  	return zl;
>  }
> +
> +/*
> + * huge_mpol_nodes_allowed -- mempolicy extension for huge pages.
> + *
> + * Returns a [pointer to a] nodelist based on the current task's mempolicy
> + * to constraing the allocation and freeing of persistent huge pages
> + * 'Preferred', 'local' and 'interleave' mempolicy will behave more like
> + * 'bind' policy in this context.  An attempt to allocate a persistent huge
> + * page will never "fallback" to another node inside the buddy system
> + * allocator.
> + *
> + * If the task's mempolicy is "default" [NULL], just return NULL for
> + * default behavior.  Otherwise, extract the policy nodemask for 'bind'
> + * or 'interleave' policy or construct a nodemask for 'preferred' or
> + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t.
> + *
> + * N.B., it is the caller's responsibility to free a returned nodemask.
> + */
> +nodemask_t *huge_mpol_nodes_allowed(void)
> +{
> +	nodemask_t *nodes_allowed = NULL;
> +	struct mempolicy *mempolicy;
> +	int nid;
> +
> +	if (!current->mempolicy)
> +		return NULL;
> +
> +	mpol_get(current->mempolicy);
> +	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
> +	if (!nodes_allowed) {
> +		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
> +			"for huge page allocation.\nFalling back to default.\n",
> +			current->comm);
> +		goto out;
> +	}
> +	nodes_clear(*nodes_allowed);
> +
> +	mempolicy = current->mempolicy;
> +	switch (mempolicy->mode) {
> +	case MPOL_PREFERRED:
> +		if (mempolicy->flags & MPOL_F_LOCAL)
> +			nid = numa_node_id();
> +		else
> +			nid = mempolicy->v.preferred_node;
> +		node_set(nid, *nodes_allowed);
> +		break;
> +
> +	case MPOL_BIND:
> +		/* Fall through */
> +	case MPOL_INTERLEAVE:
> +		*nodes_allowed =  mempolicy->v.nodes;
> +		break;
> +
> +	default:
> +		BUG();
> +	}
> +
> +out:
> +	mpol_put(current->mempolicy);
> +	return nodes_allowed;
> +}
>  #endif
>  
>  /* Allocate a page in interleaved policy.
> Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/mempolicy.h	2009-08-24 12:12:44.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h	2009-08-24 12:12:53.000000000 -0400
> @@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str
>  extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
>  				unsigned long addr, gfp_t gfp_flags,
>  				struct mempolicy **mpol, nodemask_t **nodemask);
> +extern nodemask_t *huge_mpol_nodes_allowed(void);
>  extern unsigned slab_node(struct mempolicy *policy);
>  
>  extern enum zone_type policy_zone;
> @@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone
>  	return node_zonelist(0, gfp_flags);
>  }
>  
> +static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; }
> +
>  static inline int do_migrate_pages(struct mm_struct *mm,
>  			const nodemask_t *from_nodes,
>  			const nodemask_t *to_nodes, int flags)
> Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-24 12:12:50.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-24 12:12:53.000000000 -0400
> @@ -1257,10 +1257,13 @@ static int adjust_pool_surplus(struct hs
>  static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
>  {
>  	unsigned long min_count, ret;
> +	nodemask_t *nodes_allowed;
>  
>  	if (h->order >= MAX_ORDER)
>  		return h->max_huge_pages;
>  
> +	nodes_allowed = huge_mpol_nodes_allowed();
> +
>  	/*
>  	 * Increase the pool size
>  	 * First take pages out of surplus state.  Then make up the
> @@ -1274,7 +1277,7 @@ static unsigned long set_max_huge_pages(
>  	 */
>  	spin_lock(&hugetlb_lock);
>  	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
> -		if (!adjust_pool_surplus(h, NULL, -1))
> +		if (!adjust_pool_surplus(h, nodes_allowed, -1))
>  			break;
>  	}
>  
> @@ -1285,7 +1288,7 @@ static unsigned long set_max_huge_pages(
>  		 * and reducing the surplus.
>  		 */
>  		spin_unlock(&hugetlb_lock);
> -		ret = alloc_fresh_huge_page(h, NULL);
> +		ret = alloc_fresh_huge_page(h, nodes_allowed);
>  		spin_lock(&hugetlb_lock);
>  		if (!ret)
>  			goto out;
> @@ -1309,18 +1312,19 @@ static unsigned long set_max_huge_pages(
>  	 */
>  	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
>  	min_count = max(count, min_count);
> -	try_to_free_low(h, min_count, NULL);
> +	try_to_free_low(h, min_count, nodes_allowed);
>  	while (min_count < persistent_huge_pages(h)) {
> -		if (!free_pool_huge_page(h, NULL, 0))
> +		if (!free_pool_huge_page(h, nodes_allowed, 0))
>  			break;
>  	}
>  	while (count < persistent_huge_pages(h)) {
> -		if (!adjust_pool_surplus(h, NULL, 1))
> +		if (!adjust_pool_surplus(h, nodes_allowed, 1))
>  			break;
>  	}
>  out:
>  	ret = persistent_huge_pages(h);
>  	spin_unlock(&hugetlb_lock);
> +	kfree(nodes_allowed);
>  	return ret;
>  }
>  
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] hugetlb:  derive huge pages nodes allowed from task mempolicy
@ 2009-08-25 10:22     ` Mel Gorman
  0 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2009-08-25 10:22 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Mon, Aug 24, 2009 at 03:27:52PM -0400, Lee Schermerhorn wrote:
> [PATCH 3/4] hugetlb:  derive huge pages nodes allowed from task mempolicy
> 
> Against: 2.6.31-rc6-mmotm-090820-1918
> 
> V2:
> + cleaned up comments, removed some deemed unnecessary,
>   add some suggested by review
> + removed check for !current in huge_mpol_nodes_allowed().
> + added 'current->comm' to warning message in huge_mpol_nodes_allowed().
> + added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to
>   catch out of range node id.
> + add examples to patch description
> 
> V3: Factored this patch from V2 patch 2/3
> 
> V4: added back missing "kfree(nodes_allowed)" in set_max_nr_hugepages()
> 
> This patch derives a "nodes_allowed" node mask from the numa
> mempolicy of the task modifying the number of persistent huge
> pages to control the allocation, freeing and adjusting of surplus
> huge pages.  This mask is derived as follows:
> 
> * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
>   is produced.  This will cause the hugetlb subsystem to use
>   node_online_map as the "nodes_allowed".  This preserves the
>   behavior before this patch.
> * For "preferred" mempolicy, including explicit local allocation,
>   a nodemask with the single preferred node will be produced. 
>   "local" policy will NOT track any internode migrations of the
>   task adjusting nr_hugepages.
> * For "bind" and "interleave" policy, the mempolicy's nodemask
>   will be used.
> * Other than to inform the construction of the nodes_allowed node
>   mask, the actual mempolicy mode is ignored.  That is, all modes
>   behave like interleave over the resulting nodes_allowed mask
>   with no "fallback".
> 
> Notes:
> 
> 1) This patch introduces a subtle change in behavior:  huge page
>    allocation and freeing will be constrained by any mempolicy
>    that the task adjusting the huge page pool inherits from its
>    parent.  This policy could come from a distant ancestor.  The
>    adminstrator adjusting the huge page pool without explicitly
>    specifying a mempolicy via numactl might be surprised by this.
>    Additionaly, any mempolicy specified by numactl will be
>    constrained by the cpuset in which numactl is invoked.
> 
> 2) Hugepages allocated at boot time use the node_online_map.
>    An additional patch could implement a temporary boot time
>    huge pages nodes_allowed command line parameter.
> 
> 3) Using mempolicy to control persistent huge page allocation
>    and freeing requires no change to hugeadm when invoking
>    it via numactl, as shown in the examples below.  However,
>    hugeadm could be enhanced to take the allowed nodes as an
>    argument and set its task mempolicy itself.  This would allow
>    it to detect and warn about any non-default mempolicy that it
>    inherited from its parent, thus alleviating the issue described
>    in Note 1 above.
> 
> See the updated documentation [next patch] for more information
> about the implications of this patch.
> 
> Examples:
> 
> Starting with:
> 
> 	Node 0 HugePages_Total:     0
> 	Node 1 HugePages_Total:     0
> 	Node 2 HugePages_Total:     0
> 	Node 3 HugePages_Total:     0
> 
> Default behavior [with or without this patch] balances persistent
> hugepage allocation across nodes [with sufficient contiguous memory]:
> 
> 	hugeadm --pool-pages-min=2048Kb:32
> 
> yields:
> 
> 	Node 0 HugePages_Total:     8
> 	Node 1 HugePages_Total:     8
> 	Node 2 HugePages_Total:     8
> 	Node 3 HugePages_Total:     8
> 
> Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
> '--membind' because it allows multiple nodes to be specified
> and it's easy to type]--we can allocate huge pages on
> individual nodes or sets of nodes.  So, starting from the 
> condition above, with 8 huge pages per node:
> 
> 	numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8
> 
> yields:
> 
> 	Node 0 HugePages_Total:     8
> 	Node 1 HugePages_Total:     8
> 	Node 2 HugePages_Total:    16
> 	Node 3 HugePages_Total:     8
> 
> The incremental 8 huge pages were restricted to node 2 by the
> specified mempolicy.
> 
> Similarly, we can use mempolicy to free persistent huge pages
> from specified nodes:
> 
> 	numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8
> 
> yields:
> 
> 	Node 0 HugePages_Total:     4
> 	Node 1 HugePages_Total:     4
> 	Node 2 HugePages_Total:    16
> 	Node 3 HugePages_Total:     8
> 
> The 8 huge pages freed were balanced over nodes 0 and 1.
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

I haven't been able to test this yet because of some build and deploy
issues but I didn't spot anything wrong when eyeballing the patch. For
the moment;

Acked-by: Mel Gorman <mel@csn.ul.ie>

> 
>  include/linux/mempolicy.h |    3 ++
>  mm/hugetlb.c              |   14 ++++++----
>  mm/mempolicy.c            |   61 ++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 73 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/mempolicy.c	2009-08-24 12:12:44.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c	2009-08-24 12:12:53.000000000 -0400
> @@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm
>  	}
>  	return zl;
>  }
> +
> +/*
> + * huge_mpol_nodes_allowed -- mempolicy extension for huge pages.
> + *
> + * Returns a [pointer to a] nodelist based on the current task's mempolicy
> + * to constraing the allocation and freeing of persistent huge pages
> + * 'Preferred', 'local' and 'interleave' mempolicy will behave more like
> + * 'bind' policy in this context.  An attempt to allocate a persistent huge
> + * page will never "fallback" to another node inside the buddy system
> + * allocator.
> + *
> + * If the task's mempolicy is "default" [NULL], just return NULL for
> + * default behavior.  Otherwise, extract the policy nodemask for 'bind'
> + * or 'interleave' policy or construct a nodemask for 'preferred' or
> + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t.
> + *
> + * N.B., it is the caller's responsibility to free a returned nodemask.
> + */
> +nodemask_t *huge_mpol_nodes_allowed(void)
> +{
> +	nodemask_t *nodes_allowed = NULL;
> +	struct mempolicy *mempolicy;
> +	int nid;
> +
> +	if (!current->mempolicy)
> +		return NULL;
> +
> +	mpol_get(current->mempolicy);
> +	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
> +	if (!nodes_allowed) {
> +		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
> +			"for huge page allocation.\nFalling back to default.\n",
> +			current->comm);
> +		goto out;
> +	}
> +	nodes_clear(*nodes_allowed);
> +
> +	mempolicy = current->mempolicy;
> +	switch (mempolicy->mode) {
> +	case MPOL_PREFERRED:
> +		if (mempolicy->flags & MPOL_F_LOCAL)
> +			nid = numa_node_id();
> +		else
> +			nid = mempolicy->v.preferred_node;
> +		node_set(nid, *nodes_allowed);
> +		break;
> +
> +	case MPOL_BIND:
> +		/* Fall through */
> +	case MPOL_INTERLEAVE:
> +		*nodes_allowed =  mempolicy->v.nodes;
> +		break;
> +
> +	default:
> +		BUG();
> +	}
> +
> +out:
> +	mpol_put(current->mempolicy);
> +	return nodes_allowed;
> +}
>  #endif
>  
>  /* Allocate a page in interleaved policy.
> Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/mempolicy.h	2009-08-24 12:12:44.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h	2009-08-24 12:12:53.000000000 -0400
> @@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str
>  extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
>  				unsigned long addr, gfp_t gfp_flags,
>  				struct mempolicy **mpol, nodemask_t **nodemask);
> +extern nodemask_t *huge_mpol_nodes_allowed(void);
>  extern unsigned slab_node(struct mempolicy *policy);
>  
>  extern enum zone_type policy_zone;
> @@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone
>  	return node_zonelist(0, gfp_flags);
>  }
>  
> +static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; }
> +
>  static inline int do_migrate_pages(struct mm_struct *mm,
>  			const nodemask_t *from_nodes,
>  			const nodemask_t *to_nodes, int flags)
> Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-24 12:12:50.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-24 12:12:53.000000000 -0400
> @@ -1257,10 +1257,13 @@ static int adjust_pool_surplus(struct hs
>  static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
>  {
>  	unsigned long min_count, ret;
> +	nodemask_t *nodes_allowed;
>  
>  	if (h->order >= MAX_ORDER)
>  		return h->max_huge_pages;
>  
> +	nodes_allowed = huge_mpol_nodes_allowed();
> +
>  	/*
>  	 * Increase the pool size
>  	 * First take pages out of surplus state.  Then make up the
> @@ -1274,7 +1277,7 @@ static unsigned long set_max_huge_pages(
>  	 */
>  	spin_lock(&hugetlb_lock);
>  	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
> -		if (!adjust_pool_surplus(h, NULL, -1))
> +		if (!adjust_pool_surplus(h, nodes_allowed, -1))
>  			break;
>  	}
>  
> @@ -1285,7 +1288,7 @@ static unsigned long set_max_huge_pages(
>  		 * and reducing the surplus.
>  		 */
>  		spin_unlock(&hugetlb_lock);
> -		ret = alloc_fresh_huge_page(h, NULL);
> +		ret = alloc_fresh_huge_page(h, nodes_allowed);
>  		spin_lock(&hugetlb_lock);
>  		if (!ret)
>  			goto out;
> @@ -1309,18 +1312,19 @@ static unsigned long set_max_huge_pages(
>  	 */
>  	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
>  	min_count = max(count, min_count);
> -	try_to_free_low(h, min_count, NULL);
> +	try_to_free_low(h, min_count, nodes_allowed);
>  	while (min_count < persistent_huge_pages(h)) {
> -		if (!free_pool_huge_page(h, NULL, 0))
> +		if (!free_pool_huge_page(h, nodes_allowed, 0))
>  			break;
>  	}
>  	while (count < persistent_huge_pages(h)) {
> -		if (!adjust_pool_surplus(h, NULL, 1))
> +		if (!adjust_pool_surplus(h, nodes_allowed, 1))
>  			break;
>  	}
>  out:
>  	ret = persistent_huge_pages(h);
>  	spin_unlock(&hugetlb_lock);
> +	kfree(nodes_allowed);
>  	return ret;
>  }
>  
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
  2009-08-24 19:29 ` [PATCH 4/5] hugetlb: add per node hstate attributes Lee Schermerhorn
@ 2009-08-25 13:35     ` Mel Gorman
  2009-08-25 13:35     ` Mel Gorman
  1 sibling, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2009-08-25 13:35 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote:
> <SNIP>
>
> Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h	2009-08-24 12:12:44.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h	2009-08-24 12:12:56.000000000 -0400
> @@ -21,9 +21,12 @@
>  
>  #include <linux/sysdev.h>
>  #include <linux/cpumask.h>
> +#include <linux/hugetlb.h>
>  

Is this header inclusion necessary? It does not appear to be required by
the structure modification (which is iffy in itself as discussed in the
earlier mail) and it breaks build on x86-64.

 CC      arch/x86/kernel/setup_percpu.o
In file included from include/linux/pagemap.h:10,
                 from include/linux/mempolicy.h:62,
                 from include/linux/hugetlb.h:8,
                 from include/linux/node.h:24,
                 from include/linux/cpu.h:23,
                 from /usr/local/autobench/var/tmp/build/arch/x86/include/asm/cpu.h:5,
                 from arch/x86/kernel/setup_percpu.c:19:
include/linux/highmem.h:53: error: static declaration of kmap follows non-static declaration
/usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:60: error: previous declaration of kmap was here
include/linux/highmem.h:59: error: static declaration of kunmap follows non-static declaration
/usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:61: error: previous declaration of kunmap was here
include/linux/highmem.h:63: error: static declaration of kmap_atomic follows non-static declaration
/usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:63: error: previous declaration of kmap_atomic was here
make[2]: *** [arch/x86/kernel/setup_percpu.o] Error 1
make[1]: *** [arch/x86/kernel] Error 2



>  struct node {
>  	struct sys_device	sysdev;
> +	struct kobject		*hugepages_kobj;
> +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
>  };
>  
>  struct memory_block;
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
@ 2009-08-25 13:35     ` Mel Gorman
  0 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2009-08-25 13:35 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote:
> <SNIP>
>
> Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h	2009-08-24 12:12:44.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h	2009-08-24 12:12:56.000000000 -0400
> @@ -21,9 +21,12 @@
>  
>  #include <linux/sysdev.h>
>  #include <linux/cpumask.h>
> +#include <linux/hugetlb.h>
>  

Is this header inclusion necessary? It does not appear to be required by
the structure modification (which is iffy in itself as discussed in the
earlier mail) and it breaks build on x86-64.

 CC      arch/x86/kernel/setup_percpu.o
In file included from include/linux/pagemap.h:10,
                 from include/linux/mempolicy.h:62,
                 from include/linux/hugetlb.h:8,
                 from include/linux/node.h:24,
                 from include/linux/cpu.h:23,
                 from /usr/local/autobench/var/tmp/build/arch/x86/include/asm/cpu.h:5,
                 from arch/x86/kernel/setup_percpu.c:19:
include/linux/highmem.h:53: error: static declaration of kmap follows non-static declaration
/usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:60: error: previous declaration of kmap was here
include/linux/highmem.h:59: error: static declaration of kunmap follows non-static declaration
/usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:61: error: previous declaration of kunmap was here
include/linux/highmem.h:63: error: static declaration of kmap_atomic follows non-static declaration
/usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:63: error: previous declaration of kmap_atomic was here
make[2]: *** [arch/x86/kernel/setup_percpu.o] Error 1
make[1]: *** [arch/x86/kernel] Error 2



>  struct node {
>  	struct sys_device	sysdev;
> +	struct kobject		*hugepages_kobj;
> +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
>  };
>  
>  struct memory_block;
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] hugetlb:  derive huge pages nodes allowed from task mempolicy
  2009-08-25  8:47     ` David Rientjes
@ 2009-08-25 20:49       ` Lee Schermerhorn
  -1 siblings, 0 replies; 51+ messages in thread
From: Lee Schermerhorn @ 2009-08-25 20:49 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 2009-08-25 at 01:47 -0700, David Rientjes wrote:
> On Mon, 24 Aug 2009, Lee Schermerhorn wrote:
> 
> > This patch derives a "nodes_allowed" node mask from the numa
> > mempolicy of the task modifying the number of persistent huge
> > pages to control the allocation, freeing and adjusting of surplus
> > huge pages.  This mask is derived as follows:
> > 
> > * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
> >   is produced.  This will cause the hugetlb subsystem to use
> >   node_online_map as the "nodes_allowed".  This preserves the
> >   behavior before this patch.
> > * For "preferred" mempolicy, including explicit local allocation,
> >   a nodemask with the single preferred node will be produced. 
> >   "local" policy will NOT track any internode migrations of the
> >   task adjusting nr_hugepages.
> > * For "bind" and "interleave" policy, the mempolicy's nodemask
> >   will be used.
> > * Other than to inform the construction of the nodes_allowed node
> >   mask, the actual mempolicy mode is ignored.  That is, all modes
> >   behave like interleave over the resulting nodes_allowed mask
> >   with no "fallback".
> > 
> > Notes:
> > 
> > 1) This patch introduces a subtle change in behavior:  huge page
> >    allocation and freeing will be constrained by any mempolicy
> >    that the task adjusting the huge page pool inherits from its
> >    parent.  This policy could come from a distant ancestor.  The
> >    adminstrator adjusting the huge page pool without explicitly
> >    specifying a mempolicy via numactl might be surprised by this.
> >    Additionaly, any mempolicy specified by numactl will be
> >    constrained by the cpuset in which numactl is invoked.
> > 
> > 2) Hugepages allocated at boot time use the node_online_map.
> >    An additional patch could implement a temporary boot time
> >    huge pages nodes_allowed command line parameter.
> > 
> > 3) Using mempolicy to control persistent huge page allocation
> >    and freeing requires no change to hugeadm when invoking
> >    it via numactl, as shown in the examples below.  However,
> >    hugeadm could be enhanced to take the allowed nodes as an
> >    argument and set its task mempolicy itself.  This would allow
> >    it to detect and warn about any non-default mempolicy that it
> >    inherited from its parent, thus alleviating the issue described
> >    in Note 1 above.
> > 
> > See the updated documentation [next patch] for more information
> > about the implications of this patch.
> > 
> > Examples:
> > 
> > Starting with:
> > 
> > 	Node 0 HugePages_Total:     0
> > 	Node 1 HugePages_Total:     0
> > 	Node 2 HugePages_Total:     0
> > 	Node 3 HugePages_Total:     0
> > 
> > Default behavior [with or without this patch] balances persistent
> > hugepage allocation across nodes [with sufficient contiguous memory]:
> > 
> > 	hugeadm --pool-pages-min=2048Kb:32
> > 
> > yields:
> > 
> > 	Node 0 HugePages_Total:     8
> > 	Node 1 HugePages_Total:     8
> > 	Node 2 HugePages_Total:     8
> > 	Node 3 HugePages_Total:     8
> > 
> > Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
> > '--membind' because it allows multiple nodes to be specified
> > and it's easy to type]--we can allocate huge pages on
> > individual nodes or sets of nodes.  So, starting from the 
> > condition above, with 8 huge pages per node:
> > 
> > 	numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8
> > 
> > yields:
> > 
> > 	Node 0 HugePages_Total:     8
> > 	Node 1 HugePages_Total:     8
> > 	Node 2 HugePages_Total:    16
> > 	Node 3 HugePages_Total:     8
> > 
> > The incremental 8 huge pages were restricted to node 2 by the
> > specified mempolicy.
> > 
> > Similarly, we can use mempolicy to free persistent huge pages
> > from specified nodes:
> > 
> > 	numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8
> > 
> > yields:
> > 
> > 	Node 0 HugePages_Total:     4
> > 	Node 1 HugePages_Total:     4
> > 	Node 2 HugePages_Total:    16
> > 	Node 3 HugePages_Total:     8
> > 
> > The 8 huge pages freed were balanced over nodes 0 and 1.
> > 
> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> >  include/linux/mempolicy.h |    3 ++
> >  mm/hugetlb.c              |   14 ++++++----
> >  mm/mempolicy.c            |   61 ++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 73 insertions(+), 5 deletions(-)
> > 
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/mempolicy.c	2009-08-24 12:12:44.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c	2009-08-24 12:12:53.000000000 -0400
> > @@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm
> >  	}
> >  	return zl;
> >  }
> > +
> > +/*
> > + * huge_mpol_nodes_allowed -- mempolicy extension for huge pages.
> > + *
> > + * Returns a [pointer to a] nodelist based on the current task's mempolicy
> > + * to constraing the allocation and freeing of persistent huge pages
> > + * 'Preferred', 'local' and 'interleave' mempolicy will behave more like
> > + * 'bind' policy in this context.  An attempt to allocate a persistent huge
> > + * page will never "fallback" to another node inside the buddy system
> > + * allocator.
> > + *
> > + * If the task's mempolicy is "default" [NULL], just return NULL for
> > + * default behavior.  Otherwise, extract the policy nodemask for 'bind'
> > + * or 'interleave' policy or construct a nodemask for 'preferred' or
> > + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t.
> > + *
> > + * N.B., it is the caller's responsibility to free a returned nodemask.
> > + */
> > +nodemask_t *huge_mpol_nodes_allowed(void)
> > +{
> > +	nodemask_t *nodes_allowed = NULL;
> > +	struct mempolicy *mempolicy;
> > +	int nid;
> > +
> > +	if (!current->mempolicy)
> > +		return NULL;
> > +
> > +	mpol_get(current->mempolicy);
> > +	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
> > +	if (!nodes_allowed) {
> > +		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
> > +			"for huge page allocation.\nFalling back to default.\n",
> > +			current->comm);
> 
> I don't think using '\n' inside printk's is allowed anymore.

OK, will remove.
> 
> > +		goto out;
> > +	}
> > +	nodes_clear(*nodes_allowed);
> > +
> > +	mempolicy = current->mempolicy;
> > +	switch (mempolicy->mode) {
> > +	case MPOL_PREFERRED:
> > +		if (mempolicy->flags & MPOL_F_LOCAL)
> > +			nid = numa_node_id();
> > +		else
> > +			nid = mempolicy->v.preferred_node;
> > +		node_set(nid, *nodes_allowed);
> > +		break;
> > +
> > +	case MPOL_BIND:
> > +		/* Fall through */
> > +	case MPOL_INTERLEAVE:
> > +		*nodes_allowed =  mempolicy->v.nodes;
> > +		break;
> > +
> > +	default:
> > +		BUG();
> > +	}
> > +
> > +out:
> > +	mpol_put(current->mempolicy);
> > +	return nodes_allowed;
> > +}
> 
> This should be all unnecessary, see below.
> 
> >  #endif
> >  
> >  /* Allocate a page in interleaved policy.
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/mempolicy.h	2009-08-24 12:12:44.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h	2009-08-24 12:12:53.000000000 -0400
> > @@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str
> >  extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> >  				unsigned long addr, gfp_t gfp_flags,
> >  				struct mempolicy **mpol, nodemask_t **nodemask);
> > +extern nodemask_t *huge_mpol_nodes_allowed(void);
> >  extern unsigned slab_node(struct mempolicy *policy);
> >  
> >  extern enum zone_type policy_zone;
> > @@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone
> >  	return node_zonelist(0, gfp_flags);
> >  }
> >  
> > +static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; }
> > +
> >  static inline int do_migrate_pages(struct mm_struct *mm,
> >  			const nodemask_t *from_nodes,
> >  			const nodemask_t *to_nodes, int flags)
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-24 12:12:50.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-24 12:12:53.000000000 -0400
> > @@ -1257,10 +1257,13 @@ static int adjust_pool_surplus(struct hs
> >  static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> >  {
> >  	unsigned long min_count, ret;
> > +	nodemask_t *nodes_allowed;
> >  
> >  	if (h->order >= MAX_ORDER)
> >  		return h->max_huge_pages;
> >  
> 
> Why can't you simply do this?
> 
> 	struct mempolicy *pol = NULL;
> 	nodemask_t *nodes_allowed = &node_online_map;
> 
> 	local_irq_disable();
> 	pol = current->mempolicy;
> 	mpol_get(pol);
> 	local_irq_enable();
> 	if (pol) {
> 		switch (pol->mode) {
> 		case MPOL_BIND:
> 		case MPOL_INTERLEAVE:
> 			nodes_allowed = pol->v.nodes;
> 			break;
> 		case MPOL_PREFERRED:
> 			... use NODEMASK_SCRATCH() ...
> 		default:
> 			BUG();
> 		}
> 	}
> 	mpol_put(pol);
> 
> and then use nodes_allowed throughout set_max_huge_pages()?


Well, I do use nodes_allowed [pointer] throughout set_max_huge_pages().
NODEMASK_SCRATCH() didn't exist when I wrote this, and I can't be sure
it will return a kmalloc()'d nodemask, which I need because a NULL
nodemask pointer means "all online nodes" [really all nodes with memory,
I suppose] and I need a pointer to kmalloc()'d nodemask to return from
huge_mpol_nodes_allowed().  I want to keep the access to the internals
of mempolicy in mempolicy.[ch], thus the call out to
huge_mpol_nodes_allowed(), instead of open coding it.  It's not really a
hot path, so I didn't want to fuss with a static inline in the header,
even tho' this is the only call site.

Lee


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] hugetlb:  derive huge pages nodes allowed from task mempolicy
@ 2009-08-25 20:49       ` Lee Schermerhorn
  0 siblings, 0 replies; 51+ messages in thread
From: Lee Schermerhorn @ 2009-08-25 20:49 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 2009-08-25 at 01:47 -0700, David Rientjes wrote:
> On Mon, 24 Aug 2009, Lee Schermerhorn wrote:
> 
> > This patch derives a "nodes_allowed" node mask from the numa
> > mempolicy of the task modifying the number of persistent huge
> > pages to control the allocation, freeing and adjusting of surplus
> > huge pages.  This mask is derived as follows:
> > 
> > * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
> >   is produced.  This will cause the hugetlb subsystem to use
> >   node_online_map as the "nodes_allowed".  This preserves the
> >   behavior before this patch.
> > * For "preferred" mempolicy, including explicit local allocation,
> >   a nodemask with the single preferred node will be produced. 
> >   "local" policy will NOT track any internode migrations of the
> >   task adjusting nr_hugepages.
> > * For "bind" and "interleave" policy, the mempolicy's nodemask
> >   will be used.
> > * Other than to inform the construction of the nodes_allowed node
> >   mask, the actual mempolicy mode is ignored.  That is, all modes
> >   behave like interleave over the resulting nodes_allowed mask
> >   with no "fallback".
> > 
> > Notes:
> > 
> > 1) This patch introduces a subtle change in behavior:  huge page
> >    allocation and freeing will be constrained by any mempolicy
> >    that the task adjusting the huge page pool inherits from its
> >    parent.  This policy could come from a distant ancestor.  The
> >    adminstrator adjusting the huge page pool without explicitly
> >    specifying a mempolicy via numactl might be surprised by this.
> >    Additionaly, any mempolicy specified by numactl will be
> >    constrained by the cpuset in which numactl is invoked.
> > 
> > 2) Hugepages allocated at boot time use the node_online_map.
> >    An additional patch could implement a temporary boot time
> >    huge pages nodes_allowed command line parameter.
> > 
> > 3) Using mempolicy to control persistent huge page allocation
> >    and freeing requires no change to hugeadm when invoking
> >    it via numactl, as shown in the examples below.  However,
> >    hugeadm could be enhanced to take the allowed nodes as an
> >    argument and set its task mempolicy itself.  This would allow
> >    it to detect and warn about any non-default mempolicy that it
> >    inherited from its parent, thus alleviating the issue described
> >    in Note 1 above.
> > 
> > See the updated documentation [next patch] for more information
> > about the implications of this patch.
> > 
> > Examples:
> > 
> > Starting with:
> > 
> > 	Node 0 HugePages_Total:     0
> > 	Node 1 HugePages_Total:     0
> > 	Node 2 HugePages_Total:     0
> > 	Node 3 HugePages_Total:     0
> > 
> > Default behavior [with or without this patch] balances persistent
> > hugepage allocation across nodes [with sufficient contiguous memory]:
> > 
> > 	hugeadm --pool-pages-min=2048Kb:32
> > 
> > yields:
> > 
> > 	Node 0 HugePages_Total:     8
> > 	Node 1 HugePages_Total:     8
> > 	Node 2 HugePages_Total:     8
> > 	Node 3 HugePages_Total:     8
> > 
> > Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
> > '--membind' because it allows multiple nodes to be specified
> > and it's easy to type]--we can allocate huge pages on
> > individual nodes or sets of nodes.  So, starting from the 
> > condition above, with 8 huge pages per node:
> > 
> > 	numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8
> > 
> > yields:
> > 
> > 	Node 0 HugePages_Total:     8
> > 	Node 1 HugePages_Total:     8
> > 	Node 2 HugePages_Total:    16
> > 	Node 3 HugePages_Total:     8
> > 
> > The incremental 8 huge pages were restricted to node 2 by the
> > specified mempolicy.
> > 
> > Similarly, we can use mempolicy to free persistent huge pages
> > from specified nodes:
> > 
> > 	numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8
> > 
> > yields:
> > 
> > 	Node 0 HugePages_Total:     4
> > 	Node 1 HugePages_Total:     4
> > 	Node 2 HugePages_Total:    16
> > 	Node 3 HugePages_Total:     8
> > 
> > The 8 huge pages freed were balanced over nodes 0 and 1.
> > 
> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> >  include/linux/mempolicy.h |    3 ++
> >  mm/hugetlb.c              |   14 ++++++----
> >  mm/mempolicy.c            |   61 ++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 73 insertions(+), 5 deletions(-)
> > 
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/mempolicy.c	2009-08-24 12:12:44.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c	2009-08-24 12:12:53.000000000 -0400
> > @@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm
> >  	}
> >  	return zl;
> >  }
> > +
> > +/*
> > + * huge_mpol_nodes_allowed -- mempolicy extension for huge pages.
> > + *
> > + * Returns a [pointer to a] nodelist based on the current task's mempolicy
> > + * to constraing the allocation and freeing of persistent huge pages
> > + * 'Preferred', 'local' and 'interleave' mempolicy will behave more like
> > + * 'bind' policy in this context.  An attempt to allocate a persistent huge
> > + * page will never "fallback" to another node inside the buddy system
> > + * allocator.
> > + *
> > + * If the task's mempolicy is "default" [NULL], just return NULL for
> > + * default behavior.  Otherwise, extract the policy nodemask for 'bind'
> > + * or 'interleave' policy or construct a nodemask for 'preferred' or
> > + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t.
> > + *
> > + * N.B., it is the caller's responsibility to free a returned nodemask.
> > + */
> > +nodemask_t *huge_mpol_nodes_allowed(void)
> > +{
> > +	nodemask_t *nodes_allowed = NULL;
> > +	struct mempolicy *mempolicy;
> > +	int nid;
> > +
> > +	if (!current->mempolicy)
> > +		return NULL;
> > +
> > +	mpol_get(current->mempolicy);
> > +	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
> > +	if (!nodes_allowed) {
> > +		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
> > +			"for huge page allocation.\nFalling back to default.\n",
> > +			current->comm);
> 
> I don't think using '\n' inside printk's is allowed anymore.

OK, will remove.
> 
> > +		goto out;
> > +	}
> > +	nodes_clear(*nodes_allowed);
> > +
> > +	mempolicy = current->mempolicy;
> > +	switch (mempolicy->mode) {
> > +	case MPOL_PREFERRED:
> > +		if (mempolicy->flags & MPOL_F_LOCAL)
> > +			nid = numa_node_id();
> > +		else
> > +			nid = mempolicy->v.preferred_node;
> > +		node_set(nid, *nodes_allowed);
> > +		break;
> > +
> > +	case MPOL_BIND:
> > +		/* Fall through */
> > +	case MPOL_INTERLEAVE:
> > +		*nodes_allowed =  mempolicy->v.nodes;
> > +		break;
> > +
> > +	default:
> > +		BUG();
> > +	}
> > +
> > +out:
> > +	mpol_put(current->mempolicy);
> > +	return nodes_allowed;
> > +}
> 
> This should be all unnecessary, see below.
> 
> >  #endif
> >  
> >  /* Allocate a page in interleaved policy.
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/mempolicy.h	2009-08-24 12:12:44.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h	2009-08-24 12:12:53.000000000 -0400
> > @@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str
> >  extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> >  				unsigned long addr, gfp_t gfp_flags,
> >  				struct mempolicy **mpol, nodemask_t **nodemask);
> > +extern nodemask_t *huge_mpol_nodes_allowed(void);
> >  extern unsigned slab_node(struct mempolicy *policy);
> >  
> >  extern enum zone_type policy_zone;
> > @@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone
> >  	return node_zonelist(0, gfp_flags);
> >  }
> >  
> > +static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; }
> > +
> >  static inline int do_migrate_pages(struct mm_struct *mm,
> >  			const nodemask_t *from_nodes,
> >  			const nodemask_t *to_nodes, int flags)
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-24 12:12:50.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-24 12:12:53.000000000 -0400
> > @@ -1257,10 +1257,13 @@ static int adjust_pool_surplus(struct hs
> >  static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> >  {
> >  	unsigned long min_count, ret;
> > +	nodemask_t *nodes_allowed;
> >  
> >  	if (h->order >= MAX_ORDER)
> >  		return h->max_huge_pages;
> >  
> 
> Why can't you simply do this?
> 
> 	struct mempolicy *pol = NULL;
> 	nodemask_t *nodes_allowed = &node_online_map;
> 
> 	local_irq_disable();
> 	pol = current->mempolicy;
> 	mpol_get(pol);
> 	local_irq_enable();
> 	if (pol) {
> 		switch (pol->mode) {
> 		case MPOL_BIND:
> 		case MPOL_INTERLEAVE:
> 			nodes_allowed = pol->v.nodes;
> 			break;
> 		case MPOL_PREFERRED:
> 			... use NODEMASK_SCRATCH() ...
> 		default:
> 			BUG();
> 		}
> 	}
> 	mpol_put(pol);
> 
> and then use nodes_allowed throughout set_max_huge_pages()?


Well, I do use nodes_allowed [pointer] throughout set_max_huge_pages().
NODEMASK_SCRATCH() didn't exist when I wrote this, and I can't be sure
it will return a kmalloc()'d nodemask, which I need because a NULL
nodemask pointer means "all online nodes" [really all nodes with memory,
I suppose] and I need a pointer to kmalloc()'d nodemask to return from
huge_mpol_nodes_allowed().  I want to keep the access to the internals
of mempolicy in mempolicy.[ch], thus the call out to
huge_mpol_nodes_allowed(), instead of open coding it.  It's not really a
hot path, so I didn't want to fuss with a static inline in the header,
even tho' this is the only call site.

Lee



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
  2009-08-25 10:19     ` Mel Gorman
@ 2009-08-25 20:49       ` Lee Schermerhorn
  -1 siblings, 0 replies; 51+ messages in thread
From: Lee Schermerhorn @ 2009-08-25 20:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 2009-08-25 at 11:19 +0100, Mel Gorman wrote:
> On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote:
> > PATCH/RFC 5/4 hugetlb:  register per node hugepages attributes
> > 
> > Against: 2.6.31-rc6-mmotm-090820-1918
> > 
> > V2:  remove dependency on kobject private bitfield.  Search
> >      global hstates then all per node hstates for kobject
> >      match in attribute show/store functions.
> > 
> > V3:  rebase atop the mempolicy-based hugepage alloc/free;
> >      use custom "nodes_allowed" to restrict alloc/free to
> >      a specific node via per node attributes.  Per node
> >      attribute overrides mempolicy.  I.e., mempolicy only
> >      applies to global attributes.
> > 
> > To demonstrate feasibility--if not advisability--of supporting
> > both mempolicy-based persistent huge page management with per
> > node "override" attributes.
> > 
> > This patch adds the per huge page size control/query attributes
> > to the per node sysdevs:
> > 
> > /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
> > 	nr_hugepages       - r/w
> > 	free_huge_pages    - r/o
> > 	surplus_huge_pages - r/o
> > 
> > The patch attempts to re-use/share as much of the existing
> > global hstate attribute initialization and handling, and the
> > "nodes_allowed" constraint processing as possible.
> > In set_max_huge_pages(), a node id < 0 indicates a change to
> > global hstate parameters.  In this case, any non-default task
> > mempolicy will be used to generate the nodes_allowed mask.  A
> > node id > 0 indicates a node specific update and the count 
> > argument specifies the target count for the node.  From this
> > info, we compute the target global count for the hstate and
> > construct a nodes_allowed node mask contain only the specified
> > node.  Thus, setting the node specific nr_hugepages via the
> > per node attribute effectively overrides any task mempolicy.
> > 
> > 
> > Issue:  dependency of base driver [node] dependency on hugetlbfs module.
> > We want to keep all of the hstate attribute registration and handling
> > in the hugetlb module.  However, we need to call into this code to
> > register the per node hstate attributes on node hot plug.
> > 
> > With this patch:
> > 
> > (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
> > ./  ../  free_hugepages  nr_hugepages  surplus_hugepages
> > 
> > Starting from:
> > Node 0 HugePages_Total:     0
> > Node 0 HugePages_Free:      0
> > Node 0 HugePages_Surp:      0
> > Node 1 HugePages_Total:     0
> > Node 1 HugePages_Free:      0
> > Node 1 HugePages_Surp:      0
> > Node 2 HugePages_Total:     0
> > Node 2 HugePages_Free:      0
> > Node 2 HugePages_Surp:      0
> > Node 3 HugePages_Total:     0
> > Node 3 HugePages_Free:      0
> > Node 3 HugePages_Surp:      0
> > vm.nr_hugepages = 0
> > 
> > Allocate 16 persistent huge pages on node 2:
> > (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
> > 
> > [Note that this is equivalent to:
> > 	numactl -m 2 hugeadmin --pool-pages-min 2M:+16
> > ]
> > 
> > Yields:
> > Node 0 HugePages_Total:     0
> > Node 0 HugePages_Free:      0
> > Node 0 HugePages_Surp:      0
> > Node 1 HugePages_Total:     0
> > Node 1 HugePages_Free:      0
> > Node 1 HugePages_Surp:      0
> > Node 2 HugePages_Total:    16
> > Node 2 HugePages_Free:     16
> > Node 2 HugePages_Surp:      0
> > Node 3 HugePages_Total:     0
> > Node 3 HugePages_Free:      0
> > Node 3 HugePages_Surp:      0
> > vm.nr_hugepages = 16
> > 
> > Global controls work as expected--reduce pool to 8 persistent huge pages:
> > (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > 
> > Node 0 HugePages_Total:     0
> > Node 0 HugePages_Free:      0
> > Node 0 HugePages_Surp:      0
> > Node 1 HugePages_Total:     0
> > Node 1 HugePages_Free:      0
> > Node 1 HugePages_Surp:      0
> > Node 2 HugePages_Total:     8
> > Node 2 HugePages_Free:      8
> > Node 2 HugePages_Surp:      0
> > Node 3 HugePages_Total:     0
> > Node 3 HugePages_Free:      0
> > Node 3 HugePages_Surp:      0
> > 
> > 
> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> >  drivers/base/node.c     |    2 
> >  include/linux/hugetlb.h |    6 +
> >  include/linux/node.h    |    3 
> >  mm/hugetlb.c            |  213 +++++++++++++++++++++++++++++++++++++++++-------
> >  4 files changed, 197 insertions(+), 27 deletions(-)
> > 
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/drivers/base/node.c	2009-08-24 12:12:44.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c	2009-08-24 12:12:56.000000000 -0400
> > @@ -200,6 +200,7 @@ int register_node(struct node *node, int
> >  		sysdev_create_file(&node->sysdev, &attr_distance);
> >  
> >  		scan_unevictable_register_node(node);
> > +		hugetlb_register_node(node);
> >  	}
> >  	return error;
> >  }
> > @@ -220,6 +221,7 @@ void unregister_node(struct node *node)
> >  	sysdev_remove_file(&node->sysdev, &attr_distance);
> >  
> >  	scan_unevictable_unregister_node(node);
> > +	hugetlb_unregister_node(node);
> >  
> >  	sysdev_unregister(&node->sysdev);
> >  }
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/hugetlb.h	2009-08-24 12:12:44.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h	2009-08-24 12:12:56.000000000 -0400
> > @@ -278,6 +278,10 @@ static inline struct hstate *page_hstate
> >  	return size_to_hstate(PAGE_SIZE << compound_order(page));
> >  }
> >  
> > +struct node;
> > +extern void hugetlb_register_node(struct node *);
> > +extern void hugetlb_unregister_node(struct node *);
> > +
> >  #else
> >  struct hstate {};
> >  #define alloc_bootmem_huge_page(h) NULL
> > @@ -294,6 +298,8 @@ static inline unsigned int pages_per_hug
> >  {
> >  	return 1;
> >  }
> > +#define hugetlb_register_node(NP)
> > +#define hugetlb_unregister_node(NP)
> >  #endif
> >  
> 
> This also needs to be done for the !NUMA case. Try building without NUMA
> set and you get the following with this patch applied
> 
>   CC      mm/hugetlb.o
> mm/hugetlb.c: In function AcA?A?hugetlb_exitAcA?A?:
> mm/hugetlb.c:1629: error: implicit declaration of function AcA?A?hugetlb_unregister_all_nodesAcA?A?
> mm/hugetlb.c: In function AcA?A?hugetlb_initAcA?A?:
> mm/hugetlb.c:1665: error: implicit declaration of function AcA?A?hugetlb_register_all_nodesAcA?A?
> make[1]: *** [mm/hugetlb.o] Error 1
> make: *** [mm] Error 2

Ouch!  Sorry.  Will add stubs.

> 
> 
> >  #endif /* _LINUX_HUGETLB_H */
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-24 12:12:53.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-24 12:12:56.000000000 -0400
> > @@ -24,6 +24,7 @@
> >  #include <asm/io.h>
> >  
> >  #include <linux/hugetlb.h>
> > +#include <linux/node.h>
> >  #include "internal.h"
> >  
> >  const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> > @@ -1253,8 +1254,24 @@ static int adjust_pool_surplus(struct hs
> >  	return ret;
> >  }
> >  
> > +static nodemask_t *nodes_allowed_from_node(int nid)
> > +{
> 
> This name is a bit weird. It's creating a nodemask with just a single
> node allowed.
> 
> Is there something wrong with using the existing function
> nodemask_of_node()? If stack is the problem, prehaps there is some macro
> magic that would allow a nodemask to be either declared on the stack or
> kmalloc'd.

Yeah.  nodemask_of_node() creates an on-stack mask, invisibly, in a
block nested inside the context where it's invoked.  I would be
declaring the nodemask in the compound else clause and don't want to
access it [via the nodes_allowed pointer] from outside of there.

> 
> > +	nodemask_t *nodes_allowed;
> > +	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
> > +	if (!nodes_allowed) {
> > +		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
> > +			"for huge page allocation.\nFalling back to default.\n",
> > +			current->comm);
> > +	} else {
> > +		nodes_clear(*nodes_allowed);
> > +		node_set(nid, *nodes_allowed);
> > +	}
> > +	return nodes_allowed;
> > +}
> > +
> >  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> > -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> > +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> > +								int nid)
> >  {
> >  	unsigned long min_count, ret;
> >  	nodemask_t *nodes_allowed;
> > @@ -1262,7 +1279,17 @@ static unsigned long set_max_huge_pages(
> >  	if (h->order >= MAX_ORDER)
> >  		return h->max_huge_pages;
> >  
> > -	nodes_allowed = huge_mpol_nodes_allowed();
> > +	if (nid < 0)
> > +		nodes_allowed = huge_mpol_nodes_allowed();
> 
> hugetlb is a bit littered with magic numbers been passed into functions.
> Attempts have been made to clear them up as according as patches change
> that area. Would it be possible to define something like
> 
> #define HUGETLB_OBEY_MEMPOLICY -1
> 
> for the nid here as opposed to passing in -1? I know -1 is used in the page
> allocator functions but there it means "current node" and here it means
> "obey mempolicies".

Well, here it means, NO_NODE_ID_SPECIFIED or, "we didn't get here via a
per node attribute".  It means "derive nodes allowed from memory policy,
if non-default, else use nodes_online_map" [which is not exactly the
same as obeying memory policy].

But, I can see defining a symbolic constant such as
NO_NODE[_ID_SPECIFIED].  I'll try next spin.

> 
> > +	else {
> > +		/*
> > +		 * incoming 'count' is for node 'nid' only, so
> > +		 * adjust count to global, but restrict alloc/free
> > +		 * to the specified node.
> > +		 */
> > +		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> > +		nodes_allowed = nodes_allowed_from_node(nid);
> > +	}
> >  
> >  	/*
> >  	 * Increase the pool size
> > @@ -1338,34 +1365,69 @@ out:
> >  static struct kobject *hugepages_kobj;
> >  static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
> >  
> > -static struct hstate *kobj_to_hstate(struct kobject *kobj)
> > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> > +{
> > +	int nid;
> > +
> > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > +		struct node *node = &node_devices[nid];
> > +		int hi;
> > +		for (hi = 0; hi < HUGE_MAX_HSTATE; hi++)
> 
> Does that hi mean hello, high, nid or hstate_idx?
> 
> hstate_idx would appear to be the appropriate name here.

Or just plain 'i', like in the following, pre-existing function?

> 
> > +			if (node->hstate_kobjs[hi] == kobj) {
> > +				if (nidp)
> > +					*nidp = nid;
> > +				return &hstates[hi];
> > +			}
> > +	}
> 
> Ok.... so, there is a struct node array for the sysdev and this patch adds
> references to the "hugepages" directory kobject and the subdirectories for
> each page size. We walk all the objects until we find a match. Obviously,
> this adds a dependency of base node support on hugetlbfs which feels backwards
> and you call that out in your leader.
> 
> Can this be the other way around? i.e. The struct hstate has an array of
> kobjects arranged by nid that is filled in when the node is registered?
> There will only be one kobject-per-pagesize-per-node so it seems like it
> would work. I confess, I haven't prototyped this to be 100% sure.

This will take a bit longer to sort out.  I do want to change the
registration, tho', so that hugetlb.c registers it's single node
register/unregister functions with base/node.c to remove the source
level dependency in that direction.  node.c will only register nodes on
hot plug as it's initialized too early, relative to hugetlb.c to
register them at init time.   This should break the call dependency of
base/node.c on the hugetlb module.

As far as moving the per node attributes' kobjects to the hugetlb global
hstate arrays...  Have to think about that.  I agree that it would be
nice to remove the source level [header] dependency.

> 
> > +
> > +	BUG();
> > +	return NULL;
> > +}
> > +
> > +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
> >  {
> >  	int i;
> > +
> >  	for (i = 0; i < HUGE_MAX_HSTATE; i++)
> > -		if (hstate_kobjs[i] == kobj)
> > +		if (hstate_kobjs[i] == kobj) {
> > +			if (nidp)
> > +				*nidp = -1;
> >  			return &hstates[i];
> > -	BUG();
> > -	return NULL;
> > +		}
> > +
> > +	return kobj_to_node_hstate(kobj, nidp);
> >  }
> >  
> >  static ssize_t nr_hugepages_show(struct kobject *kobj,
> >  					struct kobj_attribute *attr, char *buf)
> >  {
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > -	return sprintf(buf, "%lu\n", h->nr_huge_pages);
> > +	struct hstate *h;
> > +	unsigned long nr_huge_pages;
> > +	int nid;
> > +
> > +	h = kobj_to_hstate(kobj, &nid);
> > +	if (nid < 0)
> > +		nr_huge_pages = h->nr_huge_pages;
> 
> Here is another magic number except it means something slightly
> different. It means NR_GLOBAL_HUGEPAGES or something similar. It would
> be nice if these different special nid values could be named, preferably
> collapsed to being one "core" thing.

Again, it means "NO NODE ID specified" [via per node attribute].  Again,
I'll address this with a single constant.

> 
> > +	else
> > +		nr_huge_pages = h->nr_huge_pages_node[nid];
> > +
> > +	return sprintf(buf, "%lu\n", nr_huge_pages);
> >  }
> > +
> >  static ssize_t nr_hugepages_store(struct kobject *kobj,
> >  		struct kobj_attribute *attr, const char *buf, size_t count)
> >  {
> > -	int err;
> >  	unsigned long input;
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > +	struct hstate *h;
> > +	int nid;
> > +	int err;
> >  
> >  	err = strict_strtoul(buf, 10, &input);
> >  	if (err)
> >  		return 0;
> >  
> > -	h->max_huge_pages = set_max_huge_pages(h, input);
> 
> "input" is a bit meaningless. The function you are passing to calls this
> parameter "count". Can you match the naming please? Otherwise, I might
> guess that this is a "delta" which occurs elsewhere in the hugetlb code.

I guess I can change that.  It's the pre-exiting name, and 'count' was
already used.  Guess I can change 'count' to 'len' and 'input' to
'count'
> 
> > +	h = kobj_to_hstate(kobj, &nid);
> > +	h->max_huge_pages = set_max_huge_pages(h, input, nid);
> >  
> >  	return count;
> >  }
> > @@ -1374,15 +1436,17 @@ HSTATE_ATTR(nr_hugepages);
> >  static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
> >  					struct kobj_attribute *attr, char *buf)
> >  {
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> > +
> >  	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
> >  }
> > +
> >  static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
> >  		struct kobj_attribute *attr, const char *buf, size_t count)
> >  {
> >  	int err;
> >  	unsigned long input;
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> >  
> >  	err = strict_strtoul(buf, 10, &input);
> >  	if (err)
> > @@ -1399,15 +1463,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
> >  static ssize_t free_hugepages_show(struct kobject *kobj,
> >  					struct kobj_attribute *attr, char *buf)
> >  {
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > -	return sprintf(buf, "%lu\n", h->free_huge_pages);
> > +	struct hstate *h;
> > +	unsigned long free_huge_pages;
> > +	int nid;
> > +
> > +	h = kobj_to_hstate(kobj, &nid);
> > +	if (nid < 0)
> > +		free_huge_pages = h->free_huge_pages;
> > +	else
> > +		free_huge_pages = h->free_huge_pages_node[nid];
> > +
> > +	return sprintf(buf, "%lu\n", free_huge_pages);
> >  }
> >  HSTATE_ATTR_RO(free_hugepages);
> >  
> >  static ssize_t resv_hugepages_show(struct kobject *kobj,
> >  					struct kobj_attribute *attr, char *buf)
> >  {
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> >  	return sprintf(buf, "%lu\n", h->resv_huge_pages);
> >  }
> >  HSTATE_ATTR_RO(resv_hugepages);
> > @@ -1415,8 +1488,17 @@ HSTATE_ATTR_RO(resv_hugepages);
> >  static ssize_t surplus_hugepages_show(struct kobject *kobj,
> >  					struct kobj_attribute *attr, char *buf)
> >  {
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > -	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
> > +	struct hstate *h;
> > +	unsigned long surplus_huge_pages;
> > +	int nid;
> > +
> > +	h = kobj_to_hstate(kobj, &nid);
> > +	if (nid < 0)
> > +		surplus_huge_pages = h->surplus_huge_pages;
> > +	else
> > +		surplus_huge_pages = h->surplus_huge_pages_node[nid];
> > +
> > +	return sprintf(buf, "%lu\n", surplus_huge_pages);
> >  }
> >  HSTATE_ATTR_RO(surplus_hugepages);
> >  
> > @@ -1433,19 +1515,21 @@ static struct attribute_group hstate_att
> >  	.attrs = hstate_attrs,
> >  };
> >  
> > -static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
> > +static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
> > +				struct kobject *parent,
> > +				struct kobject **hstate_kobjs,
> > +				struct attribute_group *hstate_attr_group)
> >  {
> >  	int retval;
> > +	int hi = h - hstates;
> >  
> > -	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
> > -							hugepages_kobj);
> > -	if (!hstate_kobjs[h - hstates])
> > +	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
> > +	if (!hstate_kobjs[hi])
> >  		return -ENOMEM;
> >  
> > -	retval = sysfs_create_group(hstate_kobjs[h - hstates],
> > -							&hstate_attr_group);
> > +	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
> >  	if (retval)
> > -		kobject_put(hstate_kobjs[h - hstates]);
> > +		kobject_put(hstate_kobjs[hi]);
> >  
> >  	return retval;
> >  }
> > @@ -1460,17 +1544,90 @@ static void __init hugetlb_sysfs_init(vo
> >  		return;
> >  
> >  	for_each_hstate(h) {
> > -		err = hugetlb_sysfs_add_hstate(h);
> > +		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
> > +					 hstate_kobjs, &hstate_attr_group);
> >  		if (err)
> >  			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
> >  								h->name);
> >  	}
> >  }
> >  
> > +#ifdef CONFIG_NUMA
> > +static struct attribute *per_node_hstate_attrs[] = {
> > +	&nr_hugepages_attr.attr,
> > +	&free_hugepages_attr.attr,
> > +	&surplus_hugepages_attr.attr,
> > +	NULL,
> > +};
> > +
> > +static struct attribute_group per_node_hstate_attr_group = {
> > +	.attrs = per_node_hstate_attrs,
> > +};
> > +
> > +
> > +void hugetlb_unregister_node(struct node *node)
> > +{
> > +	struct hstate *h;
> > +
> > +	for_each_hstate(h) {
> > +		kobject_put(node->hstate_kobjs[h - hstates]);
> > +		node->hstate_kobjs[h - hstates] = NULL;
> > +	}
> > +
> > +	kobject_put(node->hugepages_kobj);
> > +	node->hugepages_kobj = NULL;
> > +}
> > +
> > +static void hugetlb_unregister_all_nodes(void)
> > +{
> > +	int nid;
> > +
> > +	for (nid = 0; nid < nr_node_ids; nid++)
> > +		hugetlb_unregister_node(&node_devices[nid]);
> > +}
> > +
> > +void hugetlb_register_node(struct node *node)
> > +{
> > +	struct hstate *h;
> > +	int err;
> > +
> > +	if (!hugepages_kobj)
> > +		return;		/* too early */
> > +
> > +	node->hugepages_kobj = kobject_create_and_add("hugepages",
> > +							&node->sysdev.kobj);
> > +	if (!node->hugepages_kobj)
> > +		return;
> > +
> > +	for_each_hstate(h) {
> > +		err = hugetlb_sysfs_add_hstate(h, node->hugepages_kobj,
> > +						node->hstate_kobjs,
> > +						&per_node_hstate_attr_group);
> > +		if (err)
> > +			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
> > +					" for node %d\n",
> > +						h->name, node->sysdev.id);
> > +	}
> > +}
> > +
> > +static void hugetlb_register_all_nodes(void)
> > +{
> > +	int nid;
> > +
> > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > +		struct node *node = &node_devices[nid];
> > +		if (node->sysdev.id == nid && !node->hugepages_kobj)
> > +			hugetlb_register_node(node);
> > +	}
> > +}
> > +#endif
> > +
> >  static void __exit hugetlb_exit(void)
> >  {
> >  	struct hstate *h;
> >  
> > +	hugetlb_unregister_all_nodes();
> > +
> >  	for_each_hstate(h) {
> >  		kobject_put(hstate_kobjs[h - hstates]);
> >  	}
> > @@ -1505,6 +1662,8 @@ static int __init hugetlb_init(void)
> >  
> >  	hugetlb_sysfs_init();
> >  
> > +	hugetlb_register_all_nodes();
> > +
> >  	return 0;
> >  }
> >  module_init(hugetlb_init);
> > @@ -1607,7 +1766,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
> >  	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
> >  
> >  	if (write)
> > -		h->max_huge_pages = set_max_huge_pages(h, tmp);
> > +		h->max_huge_pages = set_max_huge_pages(h, tmp, -1);
> >  
> >  	return 0;
> >  }
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h	2009-08-24 12:12:44.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h	2009-08-24 12:12:56.000000000 -0400
> > @@ -21,9 +21,12 @@
> >  
> >  #include <linux/sysdev.h>
> >  #include <linux/cpumask.h>
> > +#include <linux/hugetlb.h>
> >  
> >  struct node {
> >  	struct sys_device	sysdev;
> > +	struct kobject		*hugepages_kobj;
> > +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
> >  };
> >  
> >  struct memory_block;
> > 
> 
> I'm not against this idea and think it can work side-by-side with the memory
> policies. I believe it does need a bit more cleaning up before merging
> though. I also wasn't able to test this yet due to various build and
> deploy issues.

OK.  I'll do the cleanup.   I have tested this atop the mempolicy
version by working around the build issues that I thought were just
temporary glitches in the mmotm series.  In my [limited] experience, one
can interleave numactl+hugeadm with setting values via the per node
attributes and it does the right thing.  No heavy testing with racing
tasks, tho'.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
@ 2009-08-25 20:49       ` Lee Schermerhorn
  0 siblings, 0 replies; 51+ messages in thread
From: Lee Schermerhorn @ 2009-08-25 20:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 2009-08-25 at 11:19 +0100, Mel Gorman wrote:
> On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote:
> > PATCH/RFC 5/4 hugetlb:  register per node hugepages attributes
> > 
> > Against: 2.6.31-rc6-mmotm-090820-1918
> > 
> > V2:  remove dependency on kobject private bitfield.  Search
> >      global hstates then all per node hstates for kobject
> >      match in attribute show/store functions.
> > 
> > V3:  rebase atop the mempolicy-based hugepage alloc/free;
> >      use custom "nodes_allowed" to restrict alloc/free to
> >      a specific node via per node attributes.  Per node
> >      attribute overrides mempolicy.  I.e., mempolicy only
> >      applies to global attributes.
> > 
> > To demonstrate feasibility--if not advisability--of supporting
> > both mempolicy-based persistent huge page management with per
> > node "override" attributes.
> > 
> > This patch adds the per huge page size control/query attributes
> > to the per node sysdevs:
> > 
> > /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
> > 	nr_hugepages       - r/w
> > 	free_huge_pages    - r/o
> > 	surplus_huge_pages - r/o
> > 
> > The patch attempts to re-use/share as much of the existing
> > global hstate attribute initialization and handling, and the
> > "nodes_allowed" constraint processing as possible.
> > In set_max_huge_pages(), a node id < 0 indicates a change to
> > global hstate parameters.  In this case, any non-default task
> > mempolicy will be used to generate the nodes_allowed mask.  A
> > node id > 0 indicates a node specific update and the count 
> > argument specifies the target count for the node.  From this
> > info, we compute the target global count for the hstate and
> > construct a nodes_allowed node mask contain only the specified
> > node.  Thus, setting the node specific nr_hugepages via the
> > per node attribute effectively overrides any task mempolicy.
> > 
> > 
> > Issue:  dependency of base driver [node] dependency on hugetlbfs module.
> > We want to keep all of the hstate attribute registration and handling
> > in the hugetlb module.  However, we need to call into this code to
> > register the per node hstate attributes on node hot plug.
> > 
> > With this patch:
> > 
> > (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
> > ./  ../  free_hugepages  nr_hugepages  surplus_hugepages
> > 
> > Starting from:
> > Node 0 HugePages_Total:     0
> > Node 0 HugePages_Free:      0
> > Node 0 HugePages_Surp:      0
> > Node 1 HugePages_Total:     0
> > Node 1 HugePages_Free:      0
> > Node 1 HugePages_Surp:      0
> > Node 2 HugePages_Total:     0
> > Node 2 HugePages_Free:      0
> > Node 2 HugePages_Surp:      0
> > Node 3 HugePages_Total:     0
> > Node 3 HugePages_Free:      0
> > Node 3 HugePages_Surp:      0
> > vm.nr_hugepages = 0
> > 
> > Allocate 16 persistent huge pages on node 2:
> > (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
> > 
> > [Note that this is equivalent to:
> > 	numactl -m 2 hugeadmin --pool-pages-min 2M:+16
> > ]
> > 
> > Yields:
> > Node 0 HugePages_Total:     0
> > Node 0 HugePages_Free:      0
> > Node 0 HugePages_Surp:      0
> > Node 1 HugePages_Total:     0
> > Node 1 HugePages_Free:      0
> > Node 1 HugePages_Surp:      0
> > Node 2 HugePages_Total:    16
> > Node 2 HugePages_Free:     16
> > Node 2 HugePages_Surp:      0
> > Node 3 HugePages_Total:     0
> > Node 3 HugePages_Free:      0
> > Node 3 HugePages_Surp:      0
> > vm.nr_hugepages = 16
> > 
> > Global controls work as expected--reduce pool to 8 persistent huge pages:
> > (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > 
> > Node 0 HugePages_Total:     0
> > Node 0 HugePages_Free:      0
> > Node 0 HugePages_Surp:      0
> > Node 1 HugePages_Total:     0
> > Node 1 HugePages_Free:      0
> > Node 1 HugePages_Surp:      0
> > Node 2 HugePages_Total:     8
> > Node 2 HugePages_Free:      8
> > Node 2 HugePages_Surp:      0
> > Node 3 HugePages_Total:     0
> > Node 3 HugePages_Free:      0
> > Node 3 HugePages_Surp:      0
> > 
> > 
> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> >  drivers/base/node.c     |    2 
> >  include/linux/hugetlb.h |    6 +
> >  include/linux/node.h    |    3 
> >  mm/hugetlb.c            |  213 +++++++++++++++++++++++++++++++++++++++++-------
> >  4 files changed, 197 insertions(+), 27 deletions(-)
> > 
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/drivers/base/node.c	2009-08-24 12:12:44.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c	2009-08-24 12:12:56.000000000 -0400
> > @@ -200,6 +200,7 @@ int register_node(struct node *node, int
> >  		sysdev_create_file(&node->sysdev, &attr_distance);
> >  
> >  		scan_unevictable_register_node(node);
> > +		hugetlb_register_node(node);
> >  	}
> >  	return error;
> >  }
> > @@ -220,6 +221,7 @@ void unregister_node(struct node *node)
> >  	sysdev_remove_file(&node->sysdev, &attr_distance);
> >  
> >  	scan_unevictable_unregister_node(node);
> > +	hugetlb_unregister_node(node);
> >  
> >  	sysdev_unregister(&node->sysdev);
> >  }
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/hugetlb.h	2009-08-24 12:12:44.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h	2009-08-24 12:12:56.000000000 -0400
> > @@ -278,6 +278,10 @@ static inline struct hstate *page_hstate
> >  	return size_to_hstate(PAGE_SIZE << compound_order(page));
> >  }
> >  
> > +struct node;
> > +extern void hugetlb_register_node(struct node *);
> > +extern void hugetlb_unregister_node(struct node *);
> > +
> >  #else
> >  struct hstate {};
> >  #define alloc_bootmem_huge_page(h) NULL
> > @@ -294,6 +298,8 @@ static inline unsigned int pages_per_hug
> >  {
> >  	return 1;
> >  }
> > +#define hugetlb_register_node(NP)
> > +#define hugetlb_unregister_node(NP)
> >  #endif
> >  
> 
> This also needs to be done for the !NUMA case. Try building without NUMA
> set and you get the following with this patch applied
> 
>   CC      mm/hugetlb.o
> mm/hugetlb.c: In function â€˜hugetlb_exitâ€™:
> mm/hugetlb.c:1629: error: implicit declaration of function â€˜hugetlb_unregister_all_nodesâ€™
> mm/hugetlb.c: In function â€˜hugetlb_initâ€™:
> mm/hugetlb.c:1665: error: implicit declaration of function â€˜hugetlb_register_all_nodesâ€™
> make[1]: *** [mm/hugetlb.o] Error 1
> make: *** [mm] Error 2

Ouch!  Sorry.  Will add stubs.

> 
> 
> >  #endif /* _LINUX_HUGETLB_H */
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-24 12:12:53.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-24 12:12:56.000000000 -0400
> > @@ -24,6 +24,7 @@
> >  #include <asm/io.h>
> >  
> >  #include <linux/hugetlb.h>
> > +#include <linux/node.h>
> >  #include "internal.h"
> >  
> >  const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> > @@ -1253,8 +1254,24 @@ static int adjust_pool_surplus(struct hs
> >  	return ret;
> >  }
> >  
> > +static nodemask_t *nodes_allowed_from_node(int nid)
> > +{
> 
> This name is a bit weird. It's creating a nodemask with just a single
> node allowed.
> 
> Is there something wrong with using the existing function
> nodemask_of_node()? If stack is the problem, prehaps there is some macro
> magic that would allow a nodemask to be either declared on the stack or
> kmalloc'd.

Yeah.  nodemask_of_node() creates an on-stack mask, invisibly, in a
block nested inside the context where it's invoked.  I would be
declaring the nodemask in the compound else clause and don't want to
access it [via the nodes_allowed pointer] from outside of there.

> 
> > +	nodemask_t *nodes_allowed;
> > +	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
> > +	if (!nodes_allowed) {
> > +		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
> > +			"for huge page allocation.\nFalling back to default.\n",
> > +			current->comm);
> > +	} else {
> > +		nodes_clear(*nodes_allowed);
> > +		node_set(nid, *nodes_allowed);
> > +	}
> > +	return nodes_allowed;
> > +}
> > +
> >  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> > -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> > +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> > +								int nid)
> >  {
> >  	unsigned long min_count, ret;
> >  	nodemask_t *nodes_allowed;
> > @@ -1262,7 +1279,17 @@ static unsigned long set_max_huge_pages(
> >  	if (h->order >= MAX_ORDER)
> >  		return h->max_huge_pages;
> >  
> > -	nodes_allowed = huge_mpol_nodes_allowed();
> > +	if (nid < 0)
> > +		nodes_allowed = huge_mpol_nodes_allowed();
> 
> hugetlb is a bit littered with magic numbers been passed into functions.
> Attempts have been made to clear them up as according as patches change
> that area. Would it be possible to define something like
> 
> #define HUGETLB_OBEY_MEMPOLICY -1
> 
> for the nid here as opposed to passing in -1? I know -1 is used in the page
> allocator functions but there it means "current node" and here it means
> "obey mempolicies".

Well, here it means, NO_NODE_ID_SPECIFIED or, "we didn't get here via a
per node attribute".  It means "derive nodes allowed from memory policy,
if non-default, else use nodes_online_map" [which is not exactly the
same as obeying memory policy].

But, I can see defining a symbolic constant such as
NO_NODE[_ID_SPECIFIED].  I'll try next spin.

> 
> > +	else {
> > +		/*
> > +		 * incoming 'count' is for node 'nid' only, so
> > +		 * adjust count to global, but restrict alloc/free
> > +		 * to the specified node.
> > +		 */
> > +		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> > +		nodes_allowed = nodes_allowed_from_node(nid);
> > +	}
> >  
> >  	/*
> >  	 * Increase the pool size
> > @@ -1338,34 +1365,69 @@ out:
> >  static struct kobject *hugepages_kobj;
> >  static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
> >  
> > -static struct hstate *kobj_to_hstate(struct kobject *kobj)
> > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> > +{
> > +	int nid;
> > +
> > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > +		struct node *node = &node_devices[nid];
> > +		int hi;
> > +		for (hi = 0; hi < HUGE_MAX_HSTATE; hi++)
> 
> Does that hi mean hello, high, nid or hstate_idx?
> 
> hstate_idx would appear to be the appropriate name here.

Or just plain 'i', like in the following, pre-existing function?

> 
> > +			if (node->hstate_kobjs[hi] == kobj) {
> > +				if (nidp)
> > +					*nidp = nid;
> > +				return &hstates[hi];
> > +			}
> > +	}
> 
> Ok.... so, there is a struct node array for the sysdev and this patch adds
> references to the "hugepages" directory kobject and the subdirectories for
> each page size. We walk all the objects until we find a match. Obviously,
> this adds a dependency of base node support on hugetlbfs which feels backwards
> and you call that out in your leader.
> 
> Can this be the other way around? i.e. The struct hstate has an array of
> kobjects arranged by nid that is filled in when the node is registered?
> There will only be one kobject-per-pagesize-per-node so it seems like it
> would work. I confess, I haven't prototyped this to be 100% sure.

This will take a bit longer to sort out.  I do want to change the
registration, tho', so that hugetlb.c registers it's single node
register/unregister functions with base/node.c to remove the source
level dependency in that direction.  node.c will only register nodes on
hot plug as it's initialized too early, relative to hugetlb.c to
register them at init time.   This should break the call dependency of
base/node.c on the hugetlb module.

As far as moving the per node attributes' kobjects to the hugetlb global
hstate arrays...  Have to think about that.  I agree that it would be
nice to remove the source level [header] dependency.

> 
> > +
> > +	BUG();
> > +	return NULL;
> > +}
> > +
> > +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
> >  {
> >  	int i;
> > +
> >  	for (i = 0; i < HUGE_MAX_HSTATE; i++)
> > -		if (hstate_kobjs[i] == kobj)
> > +		if (hstate_kobjs[i] == kobj) {
> > +			if (nidp)
> > +				*nidp = -1;
> >  			return &hstates[i];
> > -	BUG();
> > -	return NULL;
> > +		}
> > +
> > +	return kobj_to_node_hstate(kobj, nidp);
> >  }
> >  
> >  static ssize_t nr_hugepages_show(struct kobject *kobj,
> >  					struct kobj_attribute *attr, char *buf)
> >  {
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > -	return sprintf(buf, "%lu\n", h->nr_huge_pages);
> > +	struct hstate *h;
> > +	unsigned long nr_huge_pages;
> > +	int nid;
> > +
> > +	h = kobj_to_hstate(kobj, &nid);
> > +	if (nid < 0)
> > +		nr_huge_pages = h->nr_huge_pages;
> 
> Here is another magic number except it means something slightly
> different. It means NR_GLOBAL_HUGEPAGES or something similar. It would
> be nice if these different special nid values could be named, preferably
> collapsed to being one "core" thing.

Again, it means "NO NODE ID specified" [via per node attribute].  Again,
I'll address this with a single constant.

> 
> > +	else
> > +		nr_huge_pages = h->nr_huge_pages_node[nid];
> > +
> > +	return sprintf(buf, "%lu\n", nr_huge_pages);
> >  }
> > +
> >  static ssize_t nr_hugepages_store(struct kobject *kobj,
> >  		struct kobj_attribute *attr, const char *buf, size_t count)
> >  {
> > -	int err;
> >  	unsigned long input;
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > +	struct hstate *h;
> > +	int nid;
> > +	int err;
> >  
> >  	err = strict_strtoul(buf, 10, &input);
> >  	if (err)
> >  		return 0;
> >  
> > -	h->max_huge_pages = set_max_huge_pages(h, input);
> 
> "input" is a bit meaningless. The function you are passing to calls this
> parameter "count". Can you match the naming please? Otherwise, I might
> guess that this is a "delta" which occurs elsewhere in the hugetlb code.

I guess I can change that.  It's the pre-exiting name, and 'count' was
already used.  Guess I can change 'count' to 'len' and 'input' to
'count'
> 
> > +	h = kobj_to_hstate(kobj, &nid);
> > +	h->max_huge_pages = set_max_huge_pages(h, input, nid);
> >  
> >  	return count;
> >  }
> > @@ -1374,15 +1436,17 @@ HSTATE_ATTR(nr_hugepages);
> >  static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
> >  					struct kobj_attribute *attr, char *buf)
> >  {
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> > +
> >  	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
> >  }
> > +
> >  static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
> >  		struct kobj_attribute *attr, const char *buf, size_t count)
> >  {
> >  	int err;
> >  	unsigned long input;
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> >  
> >  	err = strict_strtoul(buf, 10, &input);
> >  	if (err)
> > @@ -1399,15 +1463,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
> >  static ssize_t free_hugepages_show(struct kobject *kobj,
> >  					struct kobj_attribute *attr, char *buf)
> >  {
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > -	return sprintf(buf, "%lu\n", h->free_huge_pages);
> > +	struct hstate *h;
> > +	unsigned long free_huge_pages;
> > +	int nid;
> > +
> > +	h = kobj_to_hstate(kobj, &nid);
> > +	if (nid < 0)
> > +		free_huge_pages = h->free_huge_pages;
> > +	else
> > +		free_huge_pages = h->free_huge_pages_node[nid];
> > +
> > +	return sprintf(buf, "%lu\n", free_huge_pages);
> >  }
> >  HSTATE_ATTR_RO(free_hugepages);
> >  
> >  static ssize_t resv_hugepages_show(struct kobject *kobj,
> >  					struct kobj_attribute *attr, char *buf)
> >  {
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> >  	return sprintf(buf, "%lu\n", h->resv_huge_pages);
> >  }
> >  HSTATE_ATTR_RO(resv_hugepages);
> > @@ -1415,8 +1488,17 @@ HSTATE_ATTR_RO(resv_hugepages);
> >  static ssize_t surplus_hugepages_show(struct kobject *kobj,
> >  					struct kobj_attribute *attr, char *buf)
> >  {
> > -	struct hstate *h = kobj_to_hstate(kobj);
> > -	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
> > +	struct hstate *h;
> > +	unsigned long surplus_huge_pages;
> > +	int nid;
> > +
> > +	h = kobj_to_hstate(kobj, &nid);
> > +	if (nid < 0)
> > +		surplus_huge_pages = h->surplus_huge_pages;
> > +	else
> > +		surplus_huge_pages = h->surplus_huge_pages_node[nid];
> > +
> > +	return sprintf(buf, "%lu\n", surplus_huge_pages);
> >  }
> >  HSTATE_ATTR_RO(surplus_hugepages);
> >  
> > @@ -1433,19 +1515,21 @@ static struct attribute_group hstate_att
> >  	.attrs = hstate_attrs,
> >  };
> >  
> > -static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
> > +static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
> > +				struct kobject *parent,
> > +				struct kobject **hstate_kobjs,
> > +				struct attribute_group *hstate_attr_group)
> >  {
> >  	int retval;
> > +	int hi = h - hstates;
> >  
> > -	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
> > -							hugepages_kobj);
> > -	if (!hstate_kobjs[h - hstates])
> > +	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
> > +	if (!hstate_kobjs[hi])
> >  		return -ENOMEM;
> >  
> > -	retval = sysfs_create_group(hstate_kobjs[h - hstates],
> > -							&hstate_attr_group);
> > +	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
> >  	if (retval)
> > -		kobject_put(hstate_kobjs[h - hstates]);
> > +		kobject_put(hstate_kobjs[hi]);
> >  
> >  	return retval;
> >  }
> > @@ -1460,17 +1544,90 @@ static void __init hugetlb_sysfs_init(vo
> >  		return;
> >  
> >  	for_each_hstate(h) {
> > -		err = hugetlb_sysfs_add_hstate(h);
> > +		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
> > +					 hstate_kobjs, &hstate_attr_group);
> >  		if (err)
> >  			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
> >  								h->name);
> >  	}
> >  }
> >  
> > +#ifdef CONFIG_NUMA
> > +static struct attribute *per_node_hstate_attrs[] = {
> > +	&nr_hugepages_attr.attr,
> > +	&free_hugepages_attr.attr,
> > +	&surplus_hugepages_attr.attr,
> > +	NULL,
> > +};
> > +
> > +static struct attribute_group per_node_hstate_attr_group = {
> > +	.attrs = per_node_hstate_attrs,
> > +};
> > +
> > +
> > +void hugetlb_unregister_node(struct node *node)
> > +{
> > +	struct hstate *h;
> > +
> > +	for_each_hstate(h) {
> > +		kobject_put(node->hstate_kobjs[h - hstates]);
> > +		node->hstate_kobjs[h - hstates] = NULL;
> > +	}
> > +
> > +	kobject_put(node->hugepages_kobj);
> > +	node->hugepages_kobj = NULL;
> > +}
> > +
> > +static void hugetlb_unregister_all_nodes(void)
> > +{
> > +	int nid;
> > +
> > +	for (nid = 0; nid < nr_node_ids; nid++)
> > +		hugetlb_unregister_node(&node_devices[nid]);
> > +}
> > +
> > +void hugetlb_register_node(struct node *node)
> > +{
> > +	struct hstate *h;
> > +	int err;
> > +
> > +	if (!hugepages_kobj)
> > +		return;		/* too early */
> > +
> > +	node->hugepages_kobj = kobject_create_and_add("hugepages",
> > +							&node->sysdev.kobj);
> > +	if (!node->hugepages_kobj)
> > +		return;
> > +
> > +	for_each_hstate(h) {
> > +		err = hugetlb_sysfs_add_hstate(h, node->hugepages_kobj,
> > +						node->hstate_kobjs,
> > +						&per_node_hstate_attr_group);
> > +		if (err)
> > +			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
> > +					" for node %d\n",
> > +						h->name, node->sysdev.id);
> > +	}
> > +}
> > +
> > +static void hugetlb_register_all_nodes(void)
> > +{
> > +	int nid;
> > +
> > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > +		struct node *node = &node_devices[nid];
> > +		if (node->sysdev.id == nid && !node->hugepages_kobj)
> > +			hugetlb_register_node(node);
> > +	}
> > +}
> > +#endif
> > +
> >  static void __exit hugetlb_exit(void)
> >  {
> >  	struct hstate *h;
> >  
> > +	hugetlb_unregister_all_nodes();
> > +
> >  	for_each_hstate(h) {
> >  		kobject_put(hstate_kobjs[h - hstates]);
> >  	}
> > @@ -1505,6 +1662,8 @@ static int __init hugetlb_init(void)
> >  
> >  	hugetlb_sysfs_init();
> >  
> > +	hugetlb_register_all_nodes();
> > +
> >  	return 0;
> >  }
> >  module_init(hugetlb_init);
> > @@ -1607,7 +1766,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
> >  	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
> >  
> >  	if (write)
> > -		h->max_huge_pages = set_max_huge_pages(h, tmp);
> > +		h->max_huge_pages = set_max_huge_pages(h, tmp, -1);
> >  
> >  	return 0;
> >  }
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h	2009-08-24 12:12:44.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h	2009-08-24 12:12:56.000000000 -0400
> > @@ -21,9 +21,12 @@
> >  
> >  #include <linux/sysdev.h>
> >  #include <linux/cpumask.h>
> > +#include <linux/hugetlb.h>
> >  
> >  struct node {
> >  	struct sys_device	sysdev;
> > +	struct kobject		*hugepages_kobj;
> > +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
> >  };
> >  
> >  struct memory_block;
> > 
> 
> I'm not against this idea and think it can work side-by-side with the memory
> policies. I believe it does need a bit more cleaning up before merging
> though. I also wasn't able to test this yet due to various build and
> deploy issues.

OK.  I'll do the cleanup.   I have tested this atop the mempolicy
version by working around the build issues that I thought were just
temporary glitches in the mmotm series.  In my [limited] experience, one
can interleave numactl+hugeadm with setting values via the per node
attributes and it does the right thing.  No heavy testing with racing
tasks, tho'.

Lee

--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/5] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns
  2009-08-25  8:16     ` David Rientjes
@ 2009-08-25 20:49       ` Lee Schermerhorn
  -1 siblings, 0 replies; 51+ messages in thread
From: Lee Schermerhorn @ 2009-08-25 20:49 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 2009-08-25 at 01:16 -0700, David Rientjes wrote:
> On Mon, 24 Aug 2009, Lee Schermerhorn wrote:
> 
> > [PATCH 2/4] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns
> > 
> > Against: 2.6.31-rc6-mmotm-090820-1918
> > 
> > V3:
> > + moved this patch to after the "rework" of hstate_next_node_to_...
> >   functions as this patch is more specific to using task mempolicy
> >   to control huge page allocation and freeing.
> > 
> > In preparation for constraining huge page allocation and freeing by the
> > controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer
> > to the allocate, free and surplus adjustment functions.  For now, pass
> > NULL to indicate default behavior--i.e., use node_online_map.  A
> > subsqeuent patch will derive a non-default mask from the controlling 
> > task's numa mempolicy.
> > 
> > Reviewed-by: Mel Gorman <mel@csn.ul.ie>
> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> >  mm/hugetlb.c |  102 ++++++++++++++++++++++++++++++++++++++---------------------
> >  1 file changed, 67 insertions(+), 35 deletions(-)
> > 
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-24 12:12:46.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-24 12:12:50.000000000 -0400
> > @@ -622,19 +622,29 @@ static struct page *alloc_fresh_huge_pag
> >  }
> >  
> >  /*
> > - * common helper function for hstate_next_node_to_{alloc|free}.
> > - * return next node in node_online_map, wrapping at end.
> > + * common helper functions for hstate_next_node_to_{alloc|free}.
> > + * We may have allocated or freed a huge pages based on a different
> > + * nodes_allowed, previously, so h->next_node_to_{alloc|free} might
> > + * be outside of *nodes_allowed.  Ensure that we use the next
> > + * allowed node for alloc or free.
> >   */
> > -static int next_node_allowed(int nid)
> > +static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
> >  {
> > -	nid = next_node(nid, node_online_map);
> > +	nid = next_node(nid, *nodes_allowed);
> >  	if (nid == MAX_NUMNODES)
> > -		nid = first_node(node_online_map);
> > +		nid = first_node(*nodes_allowed);
> >  	VM_BUG_ON(nid >= MAX_NUMNODES);
> >  
> >  	return nid;
> >  }
> >  
> > +static int this_node_allowed(int nid, nodemask_t *nodes_allowed)
> > +{
> > +	if (!node_isset(nid, *nodes_allowed))
> > +		nid = next_node_allowed(nid, nodes_allowed);
> > +	return nid;
> > +}
> 
> Awkward name considering this doesn't simply return true or false as 
> expected, it returns a nid.

Well, it's not a predicate function so I wouldn't expect true or false
return, but I can see how the trailing "allowed" can sound like we're
asking the question "Is this node allowed?".  Maybe,
"get_this_node_allowed()" or "get_start_node_allowed" [we return the nid
to "startnid"], ...  Or, do you have a suggestion?  

> 
> > +
> >  /*
> >   * Use a helper variable to find the next node and then
> >   * copy it back to next_nid_to_alloc afterwards:
> > @@ -642,28 +652,34 @@ static int next_node_allowed(int nid)
> >   * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
> >   * But we don't need to use a spin_lock here: it really
> >   * doesn't matter if occasionally a racer chooses the
> > - * same nid as we do.  Move nid forward in the mask even
> > - * if we just successfully allocated a hugepage so that
> > - * the next caller gets hugepages on the next node.
> > + * same nid as we do.  Move nid forward in the mask whether
> > + * or not we just successfully allocated a hugepage so that
> > + * the next allocation addresses the next node.
> >   */
> > -static int hstate_next_node_to_alloc(struct hstate *h)
> > +static int hstate_next_node_to_alloc(struct hstate *h,
> > +					nodemask_t *nodes_allowed)
> >  {
> >  	int nid, next_nid;
> >  
> > -	nid = h->next_nid_to_alloc;
> > -	next_nid = next_node_allowed(nid);
> > +	if (!nodes_allowed)
> > +		nodes_allowed = &node_online_map;
> > +
> > +	nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed);
> > +
> > +	next_nid = next_node_allowed(nid, nodes_allowed);
> >  	h->next_nid_to_alloc = next_nid;
> > +
> >  	return nid;
> >  }
> 
> Don't need next_nid.

Well, the pre-existing comment block indicated that the use of the
apparently spurious next_nid variable is necessary to close a race.  Not
sure whether that comment still applies with this rework.  What do you
think?  

> 
> > -static int alloc_fresh_huge_page(struct hstate *h)
> > +static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
> >  {
> >  	struct page *page;
> >  	int start_nid;
> >  	int next_nid;
> >  	int ret = 0;
> >  
> > -	start_nid = hstate_next_node_to_alloc(h);
> > +	start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
> >  	next_nid = start_nid;
> >  
> >  	do {
> > @@ -672,7 +688,7 @@ static int alloc_fresh_huge_page(struct
> >  			ret = 1;
> >  			break;
> >  		}
> > -		next_nid = hstate_next_node_to_alloc(h);
> > +		next_nid = hstate_next_node_to_alloc(h, nodes_allowed);
> >  	} while (next_nid != start_nid);
> >  
> >  	if (ret)
> > @@ -689,13 +705,18 @@ static int alloc_fresh_huge_page(struct
> >   * whether or not we find a free huge page to free so that the
> >   * next attempt to free addresses the next node.
> >   */
> > -static int hstate_next_node_to_free(struct hstate *h)
> > +static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
> >  {
> >  	int nid, next_nid;
> >  
> > -	nid = h->next_nid_to_free;
> > -	next_nid = next_node_allowed(nid);
> > +	if (!nodes_allowed)
> > +		nodes_allowed = &node_online_map;
> > +
> > +	nid = this_node_allowed(h->next_nid_to_free, nodes_allowed);
> > +
> > +	next_nid = next_node_allowed(nid, nodes_allowed);
> >  	h->next_nid_to_free = next_nid;
> > +
> >  	return nid;
> >  }
> 
> Same.

Yes, and I modeled this on "next to alloc", with the extra next_nid for
the same reason.  Do we dare remove it?

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/5] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns
@ 2009-08-25 20:49       ` Lee Schermerhorn
  0 siblings, 0 replies; 51+ messages in thread
From: Lee Schermerhorn @ 2009-08-25 20:49 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 2009-08-25 at 01:16 -0700, David Rientjes wrote:
> On Mon, 24 Aug 2009, Lee Schermerhorn wrote:
> 
> > [PATCH 2/4] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns
> > 
> > Against: 2.6.31-rc6-mmotm-090820-1918
> > 
> > V3:
> > + moved this patch to after the "rework" of hstate_next_node_to_...
> >   functions as this patch is more specific to using task mempolicy
> >   to control huge page allocation and freeing.
> > 
> > In preparation for constraining huge page allocation and freeing by the
> > controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer
> > to the allocate, free and surplus adjustment functions.  For now, pass
> > NULL to indicate default behavior--i.e., use node_online_map.  A
> > subsqeuent patch will derive a non-default mask from the controlling 
> > task's numa mempolicy.
> > 
> > Reviewed-by: Mel Gorman <mel@csn.ul.ie>
> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> >  mm/hugetlb.c |  102 ++++++++++++++++++++++++++++++++++++++---------------------
> >  1 file changed, 67 insertions(+), 35 deletions(-)
> > 
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-24 12:12:46.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-24 12:12:50.000000000 -0400
> > @@ -622,19 +622,29 @@ static struct page *alloc_fresh_huge_pag
> >  }
> >  
> >  /*
> > - * common helper function for hstate_next_node_to_{alloc|free}.
> > - * return next node in node_online_map, wrapping at end.
> > + * common helper functions for hstate_next_node_to_{alloc|free}.
> > + * We may have allocated or freed a huge pages based on a different
> > + * nodes_allowed, previously, so h->next_node_to_{alloc|free} might
> > + * be outside of *nodes_allowed.  Ensure that we use the next
> > + * allowed node for alloc or free.
> >   */
> > -static int next_node_allowed(int nid)
> > +static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
> >  {
> > -	nid = next_node(nid, node_online_map);
> > +	nid = next_node(nid, *nodes_allowed);
> >  	if (nid == MAX_NUMNODES)
> > -		nid = first_node(node_online_map);
> > +		nid = first_node(*nodes_allowed);
> >  	VM_BUG_ON(nid >= MAX_NUMNODES);
> >  
> >  	return nid;
> >  }
> >  
> > +static int this_node_allowed(int nid, nodemask_t *nodes_allowed)
> > +{
> > +	if (!node_isset(nid, *nodes_allowed))
> > +		nid = next_node_allowed(nid, nodes_allowed);
> > +	return nid;
> > +}
> 
> Awkward name considering this doesn't simply return true or false as 
> expected, it returns a nid.

Well, it's not a predicate function so I wouldn't expect true or false
return, but I can see how the trailing "allowed" can sound like we're
asking the question "Is this node allowed?".  Maybe,
"get_this_node_allowed()" or "get_start_node_allowed" [we return the nid
to "startnid"], ...  Or, do you have a suggestion?  

> 
> > +
> >  /*
> >   * Use a helper variable to find the next node and then
> >   * copy it back to next_nid_to_alloc afterwards:
> > @@ -642,28 +652,34 @@ static int next_node_allowed(int nid)
> >   * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
> >   * But we don't need to use a spin_lock here: it really
> >   * doesn't matter if occasionally a racer chooses the
> > - * same nid as we do.  Move nid forward in the mask even
> > - * if we just successfully allocated a hugepage so that
> > - * the next caller gets hugepages on the next node.
> > + * same nid as we do.  Move nid forward in the mask whether
> > + * or not we just successfully allocated a hugepage so that
> > + * the next allocation addresses the next node.
> >   */
> > -static int hstate_next_node_to_alloc(struct hstate *h)
> > +static int hstate_next_node_to_alloc(struct hstate *h,
> > +					nodemask_t *nodes_allowed)
> >  {
> >  	int nid, next_nid;
> >  
> > -	nid = h->next_nid_to_alloc;
> > -	next_nid = next_node_allowed(nid);
> > +	if (!nodes_allowed)
> > +		nodes_allowed = &node_online_map;
> > +
> > +	nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed);
> > +
> > +	next_nid = next_node_allowed(nid, nodes_allowed);
> >  	h->next_nid_to_alloc = next_nid;
> > +
> >  	return nid;
> >  }
> 
> Don't need next_nid.

Well, the pre-existing comment block indicated that the use of the
apparently spurious next_nid variable is necessary to close a race.  Not
sure whether that comment still applies with this rework.  What do you
think?  

> 
> > -static int alloc_fresh_huge_page(struct hstate *h)
> > +static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
> >  {
> >  	struct page *page;
> >  	int start_nid;
> >  	int next_nid;
> >  	int ret = 0;
> >  
> > -	start_nid = hstate_next_node_to_alloc(h);
> > +	start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
> >  	next_nid = start_nid;
> >  
> >  	do {
> > @@ -672,7 +688,7 @@ static int alloc_fresh_huge_page(struct
> >  			ret = 1;
> >  			break;
> >  		}
> > -		next_nid = hstate_next_node_to_alloc(h);
> > +		next_nid = hstate_next_node_to_alloc(h, nodes_allowed);
> >  	} while (next_nid != start_nid);
> >  
> >  	if (ret)
> > @@ -689,13 +705,18 @@ static int alloc_fresh_huge_page(struct
> >   * whether or not we find a free huge page to free so that the
> >   * next attempt to free addresses the next node.
> >   */
> > -static int hstate_next_node_to_free(struct hstate *h)
> > +static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
> >  {
> >  	int nid, next_nid;
> >  
> > -	nid = h->next_nid_to_free;
> > -	next_nid = next_node_allowed(nid);
> > +	if (!nodes_allowed)
> > +		nodes_allowed = &node_online_map;
> > +
> > +	nid = this_node_allowed(h->next_nid_to_free, nodes_allowed);
> > +
> > +	next_nid = next_node_allowed(nid, nodes_allowed);
> >  	h->next_nid_to_free = next_nid;
> > +
> >  	return nid;
> >  }
> 
> Same.

Yes, and I modeled this on "next to alloc", with the extra next_nid for
the same reason.  Do we dare remove it?

Lee


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
  2009-08-25 13:35     ` Mel Gorman
@ 2009-08-25 20:49       ` Lee Schermerhorn
  -1 siblings, 0 replies; 51+ messages in thread
From: Lee Schermerhorn @ 2009-08-25 20:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 2009-08-25 at 14:35 +0100, Mel Gorman wrote:
> On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote:
> > <SNIP>
> >
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h	2009-08-24 12:12:44.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h	2009-08-24 12:12:56.000000000 -0400
> > @@ -21,9 +21,12 @@
> >  
> >  #include <linux/sysdev.h>
> >  #include <linux/cpumask.h>
> > +#include <linux/hugetlb.h>
> >  
> 
> Is this header inclusion necessary? It does not appear to be required by
> the structure modification (which is iffy in itself as discussed in the
> earlier mail) and it breaks build on x86-64.

Hi, Mel:

I recall that it is necessary to build.  You can try w/o it.

> 
>  CC      arch/x86/kernel/setup_percpu.o
> In file included from include/linux/pagemap.h:10,
>                  from include/linux/mempolicy.h:62,
>                  from include/linux/hugetlb.h:8,
>                  from include/linux/node.h:24,
>                  from include/linux/cpu.h:23,
>                  from /usr/local/autobench/var/tmp/build/arch/x86/include/asm/cpu.h:5,
>                  from arch/x86/kernel/setup_percpu.c:19:
> include/linux/highmem.h:53: error: static declaration of kmap follows non-static declaration
> /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:60: error: previous declaration of kmap was here
> include/linux/highmem.h:59: error: static declaration of kunmap follows non-static declaration
> /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:61: error: previous declaration of kunmap was here
> include/linux/highmem.h:63: error: static declaration of kmap_atomic follows non-static declaration
> /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:63: error: previous declaration of kmap_atomic was here
> make[2]: *** [arch/x86/kernel/setup_percpu.o] Error 1
> make[1]: *** [arch/x86/kernel] Error 2


I saw this.  I've been testing on x86_64.  I *thought* that it only
started showing up in a recent mmotm from changes in the linux-next
patch--e.g., a failure to set ARCH_HAS_KMAP or to handle appropriately
!ARCH_HAS_KMAP in highmem.h  But maybe that was coincidental with my
adding the include.


Lee

> 
> 
> 
> >  struct node {
> >  	struct sys_device	sysdev;
> > +	struct kobject		*hugepages_kobj;
> > +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
> >  };
> >  
> >  struct memory_block;
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
@ 2009-08-25 20:49       ` Lee Schermerhorn
  0 siblings, 0 replies; 51+ messages in thread
From: Lee Schermerhorn @ 2009-08-25 20:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 2009-08-25 at 14:35 +0100, Mel Gorman wrote:
> On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote:
> > <SNIP>
> >
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h	2009-08-24 12:12:44.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h	2009-08-24 12:12:56.000000000 -0400
> > @@ -21,9 +21,12 @@
> >  
> >  #include <linux/sysdev.h>
> >  #include <linux/cpumask.h>
> > +#include <linux/hugetlb.h>
> >  
> 
> Is this header inclusion necessary? It does not appear to be required by
> the structure modification (which is iffy in itself as discussed in the
> earlier mail) and it breaks build on x86-64.

Hi, Mel:

I recall that it is necessary to build.  You can try w/o it.

> 
>  CC      arch/x86/kernel/setup_percpu.o
> In file included from include/linux/pagemap.h:10,
>                  from include/linux/mempolicy.h:62,
>                  from include/linux/hugetlb.h:8,
>                  from include/linux/node.h:24,
>                  from include/linux/cpu.h:23,
>                  from /usr/local/autobench/var/tmp/build/arch/x86/include/asm/cpu.h:5,
>                  from arch/x86/kernel/setup_percpu.c:19:
> include/linux/highmem.h:53: error: static declaration of kmap follows non-static declaration
> /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:60: error: previous declaration of kmap was here
> include/linux/highmem.h:59: error: static declaration of kunmap follows non-static declaration
> /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:61: error: previous declaration of kunmap was here
> include/linux/highmem.h:63: error: static declaration of kmap_atomic follows non-static declaration
> /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:63: error: previous declaration of kmap_atomic was here
> make[2]: *** [arch/x86/kernel/setup_percpu.o] Error 1
> make[1]: *** [arch/x86/kernel] Error 2


I saw this.  I've been testing on x86_64.  I *thought* that it only
started showing up in a recent mmotm from changes in the linux-next
patch--e.g., a failure to set ARCH_HAS_KMAP or to handle appropriately
!ARCH_HAS_KMAP in highmem.h  But maybe that was coincidental with my
adding the include.


Lee

> 
> 
> 
> >  struct node {
> >  	struct sys_device	sysdev;
> > +	struct kobject		*hugepages_kobj;
> > +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
> >  };
> >  
> >  struct memory_block;
> > 
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/5] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns
  2009-08-25 20:49       ` Lee Schermerhorn
@ 2009-08-25 21:59         ` David Rientjes
  -1 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2009-08-25 21:59 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 25 Aug 2009, Lee Schermerhorn wrote:

> > > @@ -622,19 +622,29 @@ static struct page *alloc_fresh_huge_pag
> > >  }
> > >  
> > >  /*
> > > - * common helper function for hstate_next_node_to_{alloc|free}.
> > > - * return next node in node_online_map, wrapping at end.
> > > + * common helper functions for hstate_next_node_to_{alloc|free}.
> > > + * We may have allocated or freed a huge pages based on a different
> > > + * nodes_allowed, previously, so h->next_node_to_{alloc|free} might
> > > + * be outside of *nodes_allowed.  Ensure that we use the next
> > > + * allowed node for alloc or free.
> > >   */
> > > -static int next_node_allowed(int nid)
> > > +static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
> > >  {
> > > -	nid = next_node(nid, node_online_map);
> > > +	nid = next_node(nid, *nodes_allowed);
> > >  	if (nid == MAX_NUMNODES)
> > > -		nid = first_node(node_online_map);
> > > +		nid = first_node(*nodes_allowed);
> > >  	VM_BUG_ON(nid >= MAX_NUMNODES);
> > >  
> > >  	return nid;
> > >  }
> > >  
> > > +static int this_node_allowed(int nid, nodemask_t *nodes_allowed)
> > > +{
> > > +	if (!node_isset(nid, *nodes_allowed))
> > > +		nid = next_node_allowed(nid, nodes_allowed);
> > > +	return nid;
> > > +}
> > 
> > Awkward name considering this doesn't simply return true or false as 
> > expected, it returns a nid.
> 
> Well, it's not a predicate function so I wouldn't expect true or false
> return, but I can see how the trailing "allowed" can sound like we're
> asking the question "Is this node allowed?".  Maybe,
> "get_this_node_allowed()" or "get_start_node_allowed" [we return the nid
> to "startnid"], ...  Or, do you have a suggestion?  
> 

this_node_allowed() just seemed like a very similar name to 
cpuset_zone_allowed() in the cpuset code, which does return true or false 
depending on whether the zone is allowed by current's cpuset.  As usual 
with the mempolicy discussions, I come from a biased cpuset perspective :)

> > 
> > > +
> > >  /*
> > >   * Use a helper variable to find the next node and then
> > >   * copy it back to next_nid_to_alloc afterwards:
> > > @@ -642,28 +652,34 @@ static int next_node_allowed(int nid)
> > >   * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
> > >   * But we don't need to use a spin_lock here: it really
> > >   * doesn't matter if occasionally a racer chooses the
> > > - * same nid as we do.  Move nid forward in the mask even
> > > - * if we just successfully allocated a hugepage so that
> > > - * the next caller gets hugepages on the next node.
> > > + * same nid as we do.  Move nid forward in the mask whether
> > > + * or not we just successfully allocated a hugepage so that
> > > + * the next allocation addresses the next node.
> > >   */
> > > -static int hstate_next_node_to_alloc(struct hstate *h)
> > > +static int hstate_next_node_to_alloc(struct hstate *h,
> > > +					nodemask_t *nodes_allowed)
> > >  {
> > >  	int nid, next_nid;
> > >  
> > > -	nid = h->next_nid_to_alloc;
> > > -	next_nid = next_node_allowed(nid);
> > > +	if (!nodes_allowed)
> > > +		nodes_allowed = &node_online_map;
> > > +
> > > +	nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed);
> > > +
> > > +	next_nid = next_node_allowed(nid, nodes_allowed);
> > >  	h->next_nid_to_alloc = next_nid;
> > > +
> > >  	return nid;
> > >  }
> > 
> > Don't need next_nid.
> 
> Well, the pre-existing comment block indicated that the use of the
> apparently spurious next_nid variable is necessary to close a race.  Not
> sure whether that comment still applies with this rework.  What do you
> think?  
> 

What race is it closing exactly if gcc is going to optimize it out 
anyways?  I think you can safely fold the following into your patch.
---
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -659,15 +659,14 @@ static int this_node_allowed(int nid, nodemask_t *nodes_allowed)
 static int hstate_next_node_to_alloc(struct hstate *h,
 					nodemask_t *nodes_allowed)
 {
-	int nid, next_nid;
+	int nid;
 
 	if (!nodes_allowed)
 		nodes_allowed = &node_online_map;
 
 	nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed);
 
-	next_nid = next_node_allowed(nid, nodes_allowed);
-	h->next_nid_to_alloc = next_nid;
+	h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
 
 	return nid;
 }
@@ -707,15 +706,14 @@ static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
  */
 static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 {
-	int nid, next_nid;
+	int nid;
 
 	if (!nodes_allowed)
 		nodes_allowed = &node_online_map;
 
 	nid = this_node_allowed(h->next_nid_to_free, nodes_allowed);
 
-	next_nid = next_node_allowed(nid, nodes_allowed);
-	h->next_nid_to_free = next_nid;
+	h->next_nid_to_free = next_node_allowed(nid, nodes_allowed);
 
 	return nid;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/5] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns
@ 2009-08-25 21:59         ` David Rientjes
  0 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2009-08-25 21:59 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 25 Aug 2009, Lee Schermerhorn wrote:

> > > @@ -622,19 +622,29 @@ static struct page *alloc_fresh_huge_pag
> > >  }
> > >  
> > >  /*
> > > - * common helper function for hstate_next_node_to_{alloc|free}.
> > > - * return next node in node_online_map, wrapping at end.
> > > + * common helper functions for hstate_next_node_to_{alloc|free}.
> > > + * We may have allocated or freed a huge pages based on a different
> > > + * nodes_allowed, previously, so h->next_node_to_{alloc|free} might
> > > + * be outside of *nodes_allowed.  Ensure that we use the next
> > > + * allowed node for alloc or free.
> > >   */
> > > -static int next_node_allowed(int nid)
> > > +static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
> > >  {
> > > -	nid = next_node(nid, node_online_map);
> > > +	nid = next_node(nid, *nodes_allowed);
> > >  	if (nid == MAX_NUMNODES)
> > > -		nid = first_node(node_online_map);
> > > +		nid = first_node(*nodes_allowed);
> > >  	VM_BUG_ON(nid >= MAX_NUMNODES);
> > >  
> > >  	return nid;
> > >  }
> > >  
> > > +static int this_node_allowed(int nid, nodemask_t *nodes_allowed)
> > > +{
> > > +	if (!node_isset(nid, *nodes_allowed))
> > > +		nid = next_node_allowed(nid, nodes_allowed);
> > > +	return nid;
> > > +}
> > 
> > Awkward name considering this doesn't simply return true or false as 
> > expected, it returns a nid.
> 
> Well, it's not a predicate function so I wouldn't expect true or false
> return, but I can see how the trailing "allowed" can sound like we're
> asking the question "Is this node allowed?".  Maybe,
> "get_this_node_allowed()" or "get_start_node_allowed" [we return the nid
> to "startnid"], ...  Or, do you have a suggestion?  
> 

this_node_allowed() just seemed like a very similar name to 
cpuset_zone_allowed() in the cpuset code, which does return true or false 
depending on whether the zone is allowed by current's cpuset.  As usual 
with the mempolicy discussions, I come from a biased cpuset perspective :)

> > 
> > > +
> > >  /*
> > >   * Use a helper variable to find the next node and then
> > >   * copy it back to next_nid_to_alloc afterwards:
> > > @@ -642,28 +652,34 @@ static int next_node_allowed(int nid)
> > >   * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
> > >   * But we don't need to use a spin_lock here: it really
> > >   * doesn't matter if occasionally a racer chooses the
> > > - * same nid as we do.  Move nid forward in the mask even
> > > - * if we just successfully allocated a hugepage so that
> > > - * the next caller gets hugepages on the next node.
> > > + * same nid as we do.  Move nid forward in the mask whether
> > > + * or not we just successfully allocated a hugepage so that
> > > + * the next allocation addresses the next node.
> > >   */
> > > -static int hstate_next_node_to_alloc(struct hstate *h)
> > > +static int hstate_next_node_to_alloc(struct hstate *h,
> > > +					nodemask_t *nodes_allowed)
> > >  {
> > >  	int nid, next_nid;
> > >  
> > > -	nid = h->next_nid_to_alloc;
> > > -	next_nid = next_node_allowed(nid);
> > > +	if (!nodes_allowed)
> > > +		nodes_allowed = &node_online_map;
> > > +
> > > +	nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed);
> > > +
> > > +	next_nid = next_node_allowed(nid, nodes_allowed);
> > >  	h->next_nid_to_alloc = next_nid;
> > > +
> > >  	return nid;
> > >  }
> > 
> > Don't need next_nid.
> 
> Well, the pre-existing comment block indicated that the use of the
> apparently spurious next_nid variable is necessary to close a race.  Not
> sure whether that comment still applies with this rework.  What do you
> think?  
> 

What race is it closing exactly if gcc is going to optimize it out 
anyways?  I think you can safely fold the following into your patch.
---
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -659,15 +659,14 @@ static int this_node_allowed(int nid, nodemask_t *nodes_allowed)
 static int hstate_next_node_to_alloc(struct hstate *h,
 					nodemask_t *nodes_allowed)
 {
-	int nid, next_nid;
+	int nid;
 
 	if (!nodes_allowed)
 		nodes_allowed = &node_online_map;
 
 	nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed);
 
-	next_nid = next_node_allowed(nid, nodes_allowed);
-	h->next_nid_to_alloc = next_nid;
+	h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
 
 	return nid;
 }
@@ -707,15 +706,14 @@ static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
  */
 static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 {
-	int nid, next_nid;
+	int nid;
 
 	if (!nodes_allowed)
 		nodes_allowed = &node_online_map;
 
 	nid = this_node_allowed(h->next_nid_to_free, nodes_allowed);
 
-	next_nid = next_node_allowed(nid, nodes_allowed);
-	h->next_nid_to_free = next_nid;
+	h->next_nid_to_free = next_node_allowed(nid, nodes_allowed);
 
 	return nid;
 }

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/5] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns
  2009-08-25 20:49       ` Lee Schermerhorn
@ 2009-08-26  9:58         ` Mel Gorman
  -1 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2009-08-26  9:58 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: David Rientjes, linux-mm, linux-numa, akpm, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, Aug 25, 2009 at 04:49:34PM -0400, Lee Schermerhorn wrote:
> > > <SNIP>
> > > +static int hstate_next_node_to_alloc(struct hstate *h,
> > > +					nodemask_t *nodes_allowed)
> > >  {
> > >  	int nid, next_nid;
> > >  
> > > -	nid = h->next_nid_to_alloc;
> > > -	next_nid = next_node_allowed(nid);
> > > +	if (!nodes_allowed)
> > > +		nodes_allowed = &node_online_map;
> > > +
> > > +	nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed);
> > > +
> > > +	next_nid = next_node_allowed(nid, nodes_allowed);
> > >  	h->next_nid_to_alloc = next_nid;
> > > +
> > >  	return nid;
> > >  }
> > 
> > Don't need next_nid.
> 
> Well, the pre-existing comment block indicated that the use of the
> apparently spurious next_nid variable is necessary to close a race.  Not
> sure whether that comment still applies with this rework.  What do you
> think?  
> 

The original intention was not to return h->next_nid_to_alloc because
there is a race window where it's MAX_NUMNODES.

nid is a stack-local variable here, it should not become MAX_NUMNODES by
accident because this_node_allowed() and next_node_allowed() are both taking
care not to return MAX_NUMNODES so it's safe as a return value. Even in the
presense of races with the code structure you currently have. I think it's
safe to have

nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed);
h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);

return nid;

because at worse in the presense of races, h->next_nid_to_alloc gets
assigned to the same value twice, but never MAX_NUMNODES.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/5] hugetlb:  add nodemask arg to huge page alloc, free and surplus adjust fcns
@ 2009-08-26  9:58         ` Mel Gorman
  0 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2009-08-26  9:58 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: David Rientjes, linux-mm, linux-numa, akpm, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, Aug 25, 2009 at 04:49:34PM -0400, Lee Schermerhorn wrote:
> > > <SNIP>
> > > +static int hstate_next_node_to_alloc(struct hstate *h,
> > > +					nodemask_t *nodes_allowed)
> > >  {
> > >  	int nid, next_nid;
> > >  
> > > -	nid = h->next_nid_to_alloc;
> > > -	next_nid = next_node_allowed(nid);
> > > +	if (!nodes_allowed)
> > > +		nodes_allowed = &node_online_map;
> > > +
> > > +	nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed);
> > > +
> > > +	next_nid = next_node_allowed(nid, nodes_allowed);
> > >  	h->next_nid_to_alloc = next_nid;
> > > +
> > >  	return nid;
> > >  }
> > 
> > Don't need next_nid.
> 
> Well, the pre-existing comment block indicated that the use of the
> apparently spurious next_nid variable is necessary to close a race.  Not
> sure whether that comment still applies with this rework.  What do you
> think?  
> 

The original intention was not to return h->next_nid_to_alloc because
there is a race window where it's MAX_NUMNODES.

nid is a stack-local variable here, it should not become MAX_NUMNODES by
accident because this_node_allowed() and next_node_allowed() are both taking
care not to return MAX_NUMNODES so it's safe as a return value. Even in the
presense of races with the code structure you currently have. I think it's
safe to have

nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed);
h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);

return nid;

because at worse in the presense of races, h->next_nid_to_alloc gets
assigned to the same value twice, but never MAX_NUMNODES.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
  2009-08-25 20:49       ` Lee Schermerhorn
@ 2009-08-26 10:11         ` Mel Gorman
  -1 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2009-08-26 10:11 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, Aug 25, 2009 at 04:49:29PM -0400, Lee Schermerhorn wrote:
> > > 
> > > +static nodemask_t *nodes_allowed_from_node(int nid)
> > > +{
> > 
> > This name is a bit weird. It's creating a nodemask with just a single
> > node allowed.
> > 
> > Is there something wrong with using the existing function
> > nodemask_of_node()? If stack is the problem, prehaps there is some macro
> > magic that would allow a nodemask to be either declared on the stack or
> > kmalloc'd.
> 
> Yeah.  nodemask_of_node() creates an on-stack mask, invisibly, in a
> block nested inside the context where it's invoked.  I would be
> declaring the nodemask in the compound else clause and don't want to
> access it [via the nodes_allowed pointer] from outside of there.
> 

So, the existance of the mask on the stack is the problem. I can
understand that, they are potentially quite large.

Would it be possible to add a helper along side it like
init_nodemask_of_node() that does the same work as nodemask_of_node()
but takes a nodemask parameter? nodemask_of_node() would reuse the
init_nodemask_of_node() except it declares the nodemask on the stack.

> > 
> > > +	nodemask_t *nodes_allowed;
> > > +	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
> > > +	if (!nodes_allowed) {
> > > +		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
> > > +			"for huge page allocation.\nFalling back to default.\n",
> > > +			current->comm);
> > > +	} else {
> > > +		nodes_clear(*nodes_allowed);
> > > +		node_set(nid, *nodes_allowed);
> > > +	}
> > > +	return nodes_allowed;
> > > +}
> > > +
> > >  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> > > -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> > > +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> > > +								int nid)
> > >  {
> > >  	unsigned long min_count, ret;
> > >  	nodemask_t *nodes_allowed;
> > > @@ -1262,7 +1279,17 @@ static unsigned long set_max_huge_pages(
> > >  	if (h->order >= MAX_ORDER)
> > >  		return h->max_huge_pages;
> > >  
> > > -	nodes_allowed = huge_mpol_nodes_allowed();
> > > +	if (nid < 0)
> > > +		nodes_allowed = huge_mpol_nodes_allowed();
> > 
> > hugetlb is a bit littered with magic numbers been passed into functions.
> > Attempts have been made to clear them up as according as patches change
> > that area. Would it be possible to define something like
> > 
> > #define HUGETLB_OBEY_MEMPOLICY -1
> > 
> > for the nid here as opposed to passing in -1? I know -1 is used in the page
> > allocator functions but there it means "current node" and here it means
> > "obey mempolicies".
> 
> Well, here it means, NO_NODE_ID_SPECIFIED or, "we didn't get here via a
> per node attribute".  It means "derive nodes allowed from memory policy,
> if non-default, else use nodes_online_map" [which is not exactly the
> same as obeying memory policy].
> 
> But, I can see defining a symbolic constant such as
> NO_NODE[_ID_SPECIFIED].  I'll try next spin.
> 

That NO_NODE_ID_SPECIFIED was the underlying definition I was looking
for. It makes sense at both sites.

> > > -static struct hstate *kobj_to_hstate(struct kobject *kobj)
> > > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> > > +{
> > > +	int nid;
> > > +
> > > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > > +		struct node *node = &node_devices[nid];
> > > +		int hi;
> > > +		for (hi = 0; hi < HUGE_MAX_HSTATE; hi++)
> > 
> > Does that hi mean hello, high, nid or hstate_idx?
> > 
> > hstate_idx would appear to be the appropriate name here.
> 
> Or just plain 'i', like in the following, pre-existing function?
> 

Whichever suits you best. If hstate_idx is really what it is, I see no
harm in using it but 'i' is an index and I'd sooner recognise that than
the less meaningful "hi".

> > 
> > > +			if (node->hstate_kobjs[hi] == kobj) {
> > > +				if (nidp)
> > > +					*nidp = nid;
> > > +				return &hstates[hi];
> > > +			}
> > > +	}
> > 
> > Ok.... so, there is a struct node array for the sysdev and this patch adds
> > references to the "hugepages" directory kobject and the subdirectories for
> > each page size. We walk all the objects until we find a match. Obviously,
> > this adds a dependency of base node support on hugetlbfs which feels backwards
> > and you call that out in your leader.
> > 
> > Can this be the other way around? i.e. The struct hstate has an array of
> > kobjects arranged by nid that is filled in when the node is registered?
> > There will only be one kobject-per-pagesize-per-node so it seems like it
> > would work. I confess, I haven't prototyped this to be 100% sure.
> 
> This will take a bit longer to sort out.  I do want to change the
> registration, tho', so that hugetlb.c registers it's single node
> register/unregister functions with base/node.c to remove the source
> level dependency in that direction.  node.c will only register nodes on
> hot plug as it's initialized too early, relative to hugetlb.c to
> register them at init time.   This should break the call dependency of
> base/node.c on the hugetlb module.
> 
> As far as moving the per node attributes' kobjects to the hugetlb global
> hstate arrays...  Have to think about that.  I agree that it would be
> nice to remove the source level [header] dependency.
> 

FWIW, I see no problem with the mempolicy stuff going ahead separately from
this patch after the few relatively minor cleanups highlighted in the thread
and tackling this patch as a separate cycle. It's up to you really.

> > 
> > > +
> > > +	BUG();
> > > +	return NULL;
> > > +}
> > > +
> > > +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
> > >  {
> > >  	int i;
> > > +
> > >  	for (i = 0; i < HUGE_MAX_HSTATE; i++)
> > > -		if (hstate_kobjs[i] == kobj)
> > > +		if (hstate_kobjs[i] == kobj) {
> > > +			if (nidp)
> > > +				*nidp = -1;
> > >  			return &hstates[i];
> > > -	BUG();
> > > -	return NULL;
> > > +		}
> > > +
> > > +	return kobj_to_node_hstate(kobj, nidp);
> > >  }
> > >  
> > >  static ssize_t nr_hugepages_show(struct kobject *kobj,
> > >  					struct kobj_attribute *attr, char *buf)
> > >  {
> > > -	struct hstate *h = kobj_to_hstate(kobj);
> > > -	return sprintf(buf, "%lu\n", h->nr_huge_pages);
> > > +	struct hstate *h;
> > > +	unsigned long nr_huge_pages;
> > > +	int nid;
> > > +
> > > +	h = kobj_to_hstate(kobj, &nid);
> > > +	if (nid < 0)
> > > +		nr_huge_pages = h->nr_huge_pages;
> > 
> > Here is another magic number except it means something slightly
> > different. It means NR_GLOBAL_HUGEPAGES or something similar. It would
> > be nice if these different special nid values could be named, preferably
> > collapsed to being one "core" thing.
> 
> Again, it means "NO NODE ID specified" [via per node attribute].  Again,
> I'll address this with a single constant.
> 
> > 
> > > +	else
> > > +		nr_huge_pages = h->nr_huge_pages_node[nid];
> > > +
> > > +	return sprintf(buf, "%lu\n", nr_huge_pages);
> > >  }
> > > +
> > >  static ssize_t nr_hugepages_store(struct kobject *kobj,
> > >  		struct kobj_attribute *attr, const char *buf, size_t count)
> > >  {
> > > -	int err;
> > >  	unsigned long input;
> > > -	struct hstate *h = kobj_to_hstate(kobj);
> > > +	struct hstate *h;
> > > +	int nid;
> > > +	int err;
> > >  
> > >  	err = strict_strtoul(buf, 10, &input);
> > >  	if (err)
> > >  		return 0;
> > >  
> > > -	h->max_huge_pages = set_max_huge_pages(h, input);
> > 
> > "input" is a bit meaningless. The function you are passing to calls this
> > parameter "count". Can you match the naming please? Otherwise, I might
> > guess that this is a "delta" which occurs elsewhere in the hugetlb code.
> 
> I guess I can change that.  It's the pre-exiting name, and 'count' was
> already used.  Guess I can change 'count' to 'len' and 'input' to
> 'count'

Makes sense.

> > 
> > > +	h = kobj_to_hstate(kobj, &nid);
> > > +	h->max_huge_pages = set_max_huge_pages(h, input, nid);
> > >  
> > >  	return count;
> > >  }
> > > @@ -1374,15 +1436,17 @@ HSTATE_ATTR(nr_hugepages);
> > >  static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
> > >  					struct kobj_attribute *attr, char *buf)
> > >  {
> > > -	struct hstate *h = kobj_to_hstate(kobj);
> > > +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> > > +
> > >  	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
> > >  }
> > > +
> > >  static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
> > >  		struct kobj_attribute *attr, const char *buf, size_t count)
> > >  {
> > >  	int err;
> > >  	unsigned long input;
> > > -	struct hstate *h = kobj_to_hstate(kobj);
> > > +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> > >  
> > >  	err = strict_strtoul(buf, 10, &input);
> > >  	if (err)
> > > @@ -1399,15 +1463,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
> > >  static ssize_t free_hugepages_show(struct kobject *kobj,
> > >  					struct kobj_attribute *attr, char *buf)
> > >  {
> > > -	struct hstate *h = kobj_to_hstate(kobj);
> > > -	return sprintf(buf, "%lu\n", h->free_huge_pages);
> > > +	struct hstate *h;
> > > +	unsigned long free_huge_pages;
> > > +	int nid;
> > > +
> > > +	h = kobj_to_hstate(kobj, &nid);
> > > +	if (nid < 0)
> > > +		free_huge_pages = h->free_huge_pages;
> > > +	else
> > > +		free_huge_pages = h->free_huge_pages_node[nid];
> > > +
> > > +	return sprintf(buf, "%lu\n", free_huge_pages);
> > >  }
> > >  HSTATE_ATTR_RO(free_hugepages);
> > >  
> > >  static ssize_t resv_hugepages_show(struct kobject *kobj,
> > >  					struct kobj_attribute *attr, char *buf)
> > >  {
> > > -	struct hstate *h = kobj_to_hstate(kobj);
> > > +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> > >  	return sprintf(buf, "%lu\n", h->resv_huge_pages);
> > >  }
> > >  HSTATE_ATTR_RO(resv_hugepages);
> > > @@ -1415,8 +1488,17 @@ HSTATE_ATTR_RO(resv_hugepages);
> > >  static ssize_t surplus_hugepages_show(struct kobject *kobj,
> > >  					struct kobj_attribute *attr, char *buf)
> > >  {
> > > -	struct hstate *h = kobj_to_hstate(kobj);
> > > -	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
> > > +	struct hstate *h;
> > > +	unsigned long surplus_huge_pages;
> > > +	int nid;
> > > +
> > > +	h = kobj_to_hstate(kobj, &nid);
> > > +	if (nid < 0)
> > > +		surplus_huge_pages = h->surplus_huge_pages;
> > > +	else
> > > +		surplus_huge_pages = h->surplus_huge_pages_node[nid];
> > > +
> > > +	return sprintf(buf, "%lu\n", surplus_huge_pages);
> > >  }
> > >  HSTATE_ATTR_RO(surplus_hugepages);
> > >  
> > > @@ -1433,19 +1515,21 @@ static struct attribute_group hstate_att
> > >  	.attrs = hstate_attrs,
> > >  };
> > >  
> > > -static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
> > > +static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
> > > +				struct kobject *parent,
> > > +				struct kobject **hstate_kobjs,
> > > +				struct attribute_group *hstate_attr_group)
> > >  {
> > >  	int retval;
> > > +	int hi = h - hstates;
> > >  
> > > -	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
> > > -							hugepages_kobj);
> > > -	if (!hstate_kobjs[h - hstates])
> > > +	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
> > > +	if (!hstate_kobjs[hi])
> > >  		return -ENOMEM;
> > >  
> > > -	retval = sysfs_create_group(hstate_kobjs[h - hstates],
> > > -							&hstate_attr_group);
> > > +	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
> > >  	if (retval)
> > > -		kobject_put(hstate_kobjs[h - hstates]);
> > > +		kobject_put(hstate_kobjs[hi]);
> > >  
> > >  	return retval;
> > >  }
> > > @@ -1460,17 +1544,90 @@ static void __init hugetlb_sysfs_init(vo
> > >  		return;
> > >  
> > >  	for_each_hstate(h) {
> > > -		err = hugetlb_sysfs_add_hstate(h);
> > > +		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
> > > +					 hstate_kobjs, &hstate_attr_group);
> > >  		if (err)
> > >  			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
> > >  								h->name);
> > >  	}
> > >  }
> > >  
> > > +#ifdef CONFIG_NUMA
> > > +static struct attribute *per_node_hstate_attrs[] = {
> > > +	&nr_hugepages_attr.attr,
> > > +	&free_hugepages_attr.attr,
> > > +	&surplus_hugepages_attr.attr,
> > > +	NULL,
> > > +};
> > > +
> > > +static struct attribute_group per_node_hstate_attr_group = {
> > > +	.attrs = per_node_hstate_attrs,
> > > +};
> > > +
> > > +
> > > +void hugetlb_unregister_node(struct node *node)
> > > +{
> > > +	struct hstate *h;
> > > +
> > > +	for_each_hstate(h) {
> > > +		kobject_put(node->hstate_kobjs[h - hstates]);
> > > +		node->hstate_kobjs[h - hstates] = NULL;
> > > +	}
> > > +
> > > +	kobject_put(node->hugepages_kobj);
> > > +	node->hugepages_kobj = NULL;
> > > +}
> > > +
> > > +static void hugetlb_unregister_all_nodes(void)
> > > +{
> > > +	int nid;
> > > +
> > > +	for (nid = 0; nid < nr_node_ids; nid++)
> > > +		hugetlb_unregister_node(&node_devices[nid]);
> > > +}
> > > +
> > > +void hugetlb_register_node(struct node *node)
> > > +{
> > > +	struct hstate *h;
> > > +	int err;
> > > +
> > > +	if (!hugepages_kobj)
> > > +		return;		/* too early */
> > > +
> > > +	node->hugepages_kobj = kobject_create_and_add("hugepages",
> > > +							&node->sysdev.kobj);
> > > +	if (!node->hugepages_kobj)
> > > +		return;
> > > +
> > > +	for_each_hstate(h) {
> > > +		err = hugetlb_sysfs_add_hstate(h, node->hugepages_kobj,
> > > +						node->hstate_kobjs,
> > > +						&per_node_hstate_attr_group);
> > > +		if (err)
> > > +			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
> > > +					" for node %d\n",
> > > +						h->name, node->sysdev.id);
> > > +	}
> > > +}
> > > +
> > > +static void hugetlb_register_all_nodes(void)
> > > +{
> > > +	int nid;
> > > +
> > > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > > +		struct node *node = &node_devices[nid];
> > > +		if (node->sysdev.id == nid && !node->hugepages_kobj)
> > > +			hugetlb_register_node(node);
> > > +	}
> > > +}
> > > +#endif
> > > +
> > >  static void __exit hugetlb_exit(void)
> > >  {
> > >  	struct hstate *h;
> > >  
> > > +	hugetlb_unregister_all_nodes();
> > > +
> > >  	for_each_hstate(h) {
> > >  		kobject_put(hstate_kobjs[h - hstates]);
> > >  	}
> > > @@ -1505,6 +1662,8 @@ static int __init hugetlb_init(void)
> > >  
> > >  	hugetlb_sysfs_init();
> > >  
> > > +	hugetlb_register_all_nodes();
> > > +
> > >  	return 0;
> > >  }
> > >  module_init(hugetlb_init);
> > > @@ -1607,7 +1766,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
> > >  	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
> > >  
> > >  	if (write)
> > > -		h->max_huge_pages = set_max_huge_pages(h, tmp);
> > > +		h->max_huge_pages = set_max_huge_pages(h, tmp, -1);
> > >  
> > >  	return 0;
> > >  }
> > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h
> > > ===================================================================
> > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h	2009-08-24 12:12:44.000000000 -0400
> > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h	2009-08-24 12:12:56.000000000 -0400
> > > @@ -21,9 +21,12 @@
> > >  
> > >  #include <linux/sysdev.h>
> > >  #include <linux/cpumask.h>
> > > +#include <linux/hugetlb.h>
> > >  
> > >  struct node {
> > >  	struct sys_device	sysdev;
> > > +	struct kobject		*hugepages_kobj;
> > > +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
> > >  };
> > >  
> > >  struct memory_block;
> > > 
> > 
> > I'm not against this idea and think it can work side-by-side with the memory
> > policies. I believe it does need a bit more cleaning up before merging
> > though. I also wasn't able to test this yet due to various build and
> > deploy issues.
> 
> OK.  I'll do the cleanup.   I have tested this atop the mempolicy
> version by working around the build issues that I thought were just
> temporary glitches in the mmotm series.  In my [limited] experience, one
> can interleave numactl+hugeadm with setting values via the per node
> attributes and it does the right thing.  No heavy testing with racing
> tasks, tho'.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
@ 2009-08-26 10:11         ` Mel Gorman
  0 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2009-08-26 10:11 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, Aug 25, 2009 at 04:49:29PM -0400, Lee Schermerhorn wrote:
> > > 
> > > +static nodemask_t *nodes_allowed_from_node(int nid)
> > > +{
> > 
> > This name is a bit weird. It's creating a nodemask with just a single
> > node allowed.
> > 
> > Is there something wrong with using the existing function
> > nodemask_of_node()? If stack is the problem, prehaps there is some macro
> > magic that would allow a nodemask to be either declared on the stack or
> > kmalloc'd.
> 
> Yeah.  nodemask_of_node() creates an on-stack mask, invisibly, in a
> block nested inside the context where it's invoked.  I would be
> declaring the nodemask in the compound else clause and don't want to
> access it [via the nodes_allowed pointer] from outside of there.
> 

So, the existance of the mask on the stack is the problem. I can
understand that, they are potentially quite large.

Would it be possible to add a helper along side it like
init_nodemask_of_node() that does the same work as nodemask_of_node()
but takes a nodemask parameter? nodemask_of_node() would reuse the
init_nodemask_of_node() except it declares the nodemask on the stack.

> > 
> > > +	nodemask_t *nodes_allowed;
> > > +	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
> > > +	if (!nodes_allowed) {
> > > +		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
> > > +			"for huge page allocation.\nFalling back to default.\n",
> > > +			current->comm);
> > > +	} else {
> > > +		nodes_clear(*nodes_allowed);
> > > +		node_set(nid, *nodes_allowed);
> > > +	}
> > > +	return nodes_allowed;
> > > +}
> > > +
> > >  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> > > -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> > > +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> > > +								int nid)
> > >  {
> > >  	unsigned long min_count, ret;
> > >  	nodemask_t *nodes_allowed;
> > > @@ -1262,7 +1279,17 @@ static unsigned long set_max_huge_pages(
> > >  	if (h->order >= MAX_ORDER)
> > >  		return h->max_huge_pages;
> > >  
> > > -	nodes_allowed = huge_mpol_nodes_allowed();
> > > +	if (nid < 0)
> > > +		nodes_allowed = huge_mpol_nodes_allowed();
> > 
> > hugetlb is a bit littered with magic numbers been passed into functions.
> > Attempts have been made to clear them up as according as patches change
> > that area. Would it be possible to define something like
> > 
> > #define HUGETLB_OBEY_MEMPOLICY -1
> > 
> > for the nid here as opposed to passing in -1? I know -1 is used in the page
> > allocator functions but there it means "current node" and here it means
> > "obey mempolicies".
> 
> Well, here it means, NO_NODE_ID_SPECIFIED or, "we didn't get here via a
> per node attribute".  It means "derive nodes allowed from memory policy,
> if non-default, else use nodes_online_map" [which is not exactly the
> same as obeying memory policy].
> 
> But, I can see defining a symbolic constant such as
> NO_NODE[_ID_SPECIFIED].  I'll try next spin.
> 

That NO_NODE_ID_SPECIFIED was the underlying definition I was looking
for. It makes sense at both sites.

> > > -static struct hstate *kobj_to_hstate(struct kobject *kobj)
> > > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> > > +{
> > > +	int nid;
> > > +
> > > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > > +		struct node *node = &node_devices[nid];
> > > +		int hi;
> > > +		for (hi = 0; hi < HUGE_MAX_HSTATE; hi++)
> > 
> > Does that hi mean hello, high, nid or hstate_idx?
> > 
> > hstate_idx would appear to be the appropriate name here.
> 
> Or just plain 'i', like in the following, pre-existing function?
> 

Whichever suits you best. If hstate_idx is really what it is, I see no
harm in using it but 'i' is an index and I'd sooner recognise that than
the less meaningful "hi".

> > 
> > > +			if (node->hstate_kobjs[hi] == kobj) {
> > > +				if (nidp)
> > > +					*nidp = nid;
> > > +				return &hstates[hi];
> > > +			}
> > > +	}
> > 
> > Ok.... so, there is a struct node array for the sysdev and this patch adds
> > references to the "hugepages" directory kobject and the subdirectories for
> > each page size. We walk all the objects until we find a match. Obviously,
> > this adds a dependency of base node support on hugetlbfs which feels backwards
> > and you call that out in your leader.
> > 
> > Can this be the other way around? i.e. The struct hstate has an array of
> > kobjects arranged by nid that is filled in when the node is registered?
> > There will only be one kobject-per-pagesize-per-node so it seems like it
> > would work. I confess, I haven't prototyped this to be 100% sure.
> 
> This will take a bit longer to sort out.  I do want to change the
> registration, tho', so that hugetlb.c registers it's single node
> register/unregister functions with base/node.c to remove the source
> level dependency in that direction.  node.c will only register nodes on
> hot plug as it's initialized too early, relative to hugetlb.c to
> register them at init time.   This should break the call dependency of
> base/node.c on the hugetlb module.
> 
> As far as moving the per node attributes' kobjects to the hugetlb global
> hstate arrays...  Have to think about that.  I agree that it would be
> nice to remove the source level [header] dependency.
> 

FWIW, I see no problem with the mempolicy stuff going ahead separately from
this patch after the few relatively minor cleanups highlighted in the thread
and tackling this patch as a separate cycle. It's up to you really.

> > 
> > > +
> > > +	BUG();
> > > +	return NULL;
> > > +}
> > > +
> > > +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
> > >  {
> > >  	int i;
> > > +
> > >  	for (i = 0; i < HUGE_MAX_HSTATE; i++)
> > > -		if (hstate_kobjs[i] == kobj)
> > > +		if (hstate_kobjs[i] == kobj) {
> > > +			if (nidp)
> > > +				*nidp = -1;
> > >  			return &hstates[i];
> > > -	BUG();
> > > -	return NULL;
> > > +		}
> > > +
> > > +	return kobj_to_node_hstate(kobj, nidp);
> > >  }
> > >  
> > >  static ssize_t nr_hugepages_show(struct kobject *kobj,
> > >  					struct kobj_attribute *attr, char *buf)
> > >  {
> > > -	struct hstate *h = kobj_to_hstate(kobj);
> > > -	return sprintf(buf, "%lu\n", h->nr_huge_pages);
> > > +	struct hstate *h;
> > > +	unsigned long nr_huge_pages;
> > > +	int nid;
> > > +
> > > +	h = kobj_to_hstate(kobj, &nid);
> > > +	if (nid < 0)
> > > +		nr_huge_pages = h->nr_huge_pages;
> > 
> > Here is another magic number except it means something slightly
> > different. It means NR_GLOBAL_HUGEPAGES or something similar. It would
> > be nice if these different special nid values could be named, preferably
> > collapsed to being one "core" thing.
> 
> Again, it means "NO NODE ID specified" [via per node attribute].  Again,
> I'll address this with a single constant.
> 
> > 
> > > +	else
> > > +		nr_huge_pages = h->nr_huge_pages_node[nid];
> > > +
> > > +	return sprintf(buf, "%lu\n", nr_huge_pages);
> > >  }
> > > +
> > >  static ssize_t nr_hugepages_store(struct kobject *kobj,
> > >  		struct kobj_attribute *attr, const char *buf, size_t count)
> > >  {
> > > -	int err;
> > >  	unsigned long input;
> > > -	struct hstate *h = kobj_to_hstate(kobj);
> > > +	struct hstate *h;
> > > +	int nid;
> > > +	int err;
> > >  
> > >  	err = strict_strtoul(buf, 10, &input);
> > >  	if (err)
> > >  		return 0;
> > >  
> > > -	h->max_huge_pages = set_max_huge_pages(h, input);
> > 
> > "input" is a bit meaningless. The function you are passing to calls this
> > parameter "count". Can you match the naming please? Otherwise, I might
> > guess that this is a "delta" which occurs elsewhere in the hugetlb code.
> 
> I guess I can change that.  It's the pre-exiting name, and 'count' was
> already used.  Guess I can change 'count' to 'len' and 'input' to
> 'count'

Makes sense.

> > 
> > > +	h = kobj_to_hstate(kobj, &nid);
> > > +	h->max_huge_pages = set_max_huge_pages(h, input, nid);
> > >  
> > >  	return count;
> > >  }
> > > @@ -1374,15 +1436,17 @@ HSTATE_ATTR(nr_hugepages);
> > >  static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
> > >  					struct kobj_attribute *attr, char *buf)
> > >  {
> > > -	struct hstate *h = kobj_to_hstate(kobj);
> > > +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> > > +
> > >  	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
> > >  }
> > > +
> > >  static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
> > >  		struct kobj_attribute *attr, const char *buf, size_t count)
> > >  {
> > >  	int err;
> > >  	unsigned long input;
> > > -	struct hstate *h = kobj_to_hstate(kobj);
> > > +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> > >  
> > >  	err = strict_strtoul(buf, 10, &input);
> > >  	if (err)
> > > @@ -1399,15 +1463,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
> > >  static ssize_t free_hugepages_show(struct kobject *kobj,
> > >  					struct kobj_attribute *attr, char *buf)
> > >  {
> > > -	struct hstate *h = kobj_to_hstate(kobj);
> > > -	return sprintf(buf, "%lu\n", h->free_huge_pages);
> > > +	struct hstate *h;
> > > +	unsigned long free_huge_pages;
> > > +	int nid;
> > > +
> > > +	h = kobj_to_hstate(kobj, &nid);
> > > +	if (nid < 0)
> > > +		free_huge_pages = h->free_huge_pages;
> > > +	else
> > > +		free_huge_pages = h->free_huge_pages_node[nid];
> > > +
> > > +	return sprintf(buf, "%lu\n", free_huge_pages);
> > >  }
> > >  HSTATE_ATTR_RO(free_hugepages);
> > >  
> > >  static ssize_t resv_hugepages_show(struct kobject *kobj,
> > >  					struct kobj_attribute *attr, char *buf)
> > >  {
> > > -	struct hstate *h = kobj_to_hstate(kobj);
> > > +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> > >  	return sprintf(buf, "%lu\n", h->resv_huge_pages);
> > >  }
> > >  HSTATE_ATTR_RO(resv_hugepages);
> > > @@ -1415,8 +1488,17 @@ HSTATE_ATTR_RO(resv_hugepages);
> > >  static ssize_t surplus_hugepages_show(struct kobject *kobj,
> > >  					struct kobj_attribute *attr, char *buf)
> > >  {
> > > -	struct hstate *h = kobj_to_hstate(kobj);
> > > -	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
> > > +	struct hstate *h;
> > > +	unsigned long surplus_huge_pages;
> > > +	int nid;
> > > +
> > > +	h = kobj_to_hstate(kobj, &nid);
> > > +	if (nid < 0)
> > > +		surplus_huge_pages = h->surplus_huge_pages;
> > > +	else
> > > +		surplus_huge_pages = h->surplus_huge_pages_node[nid];
> > > +
> > > +	return sprintf(buf, "%lu\n", surplus_huge_pages);
> > >  }
> > >  HSTATE_ATTR_RO(surplus_hugepages);
> > >  
> > > @@ -1433,19 +1515,21 @@ static struct attribute_group hstate_att
> > >  	.attrs = hstate_attrs,
> > >  };
> > >  
> > > -static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
> > > +static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
> > > +				struct kobject *parent,
> > > +				struct kobject **hstate_kobjs,
> > > +				struct attribute_group *hstate_attr_group)
> > >  {
> > >  	int retval;
> > > +	int hi = h - hstates;
> > >  
> > > -	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
> > > -							hugepages_kobj);
> > > -	if (!hstate_kobjs[h - hstates])
> > > +	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
> > > +	if (!hstate_kobjs[hi])
> > >  		return -ENOMEM;
> > >  
> > > -	retval = sysfs_create_group(hstate_kobjs[h - hstates],
> > > -							&hstate_attr_group);
> > > +	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
> > >  	if (retval)
> > > -		kobject_put(hstate_kobjs[h - hstates]);
> > > +		kobject_put(hstate_kobjs[hi]);
> > >  
> > >  	return retval;
> > >  }
> > > @@ -1460,17 +1544,90 @@ static void __init hugetlb_sysfs_init(vo
> > >  		return;
> > >  
> > >  	for_each_hstate(h) {
> > > -		err = hugetlb_sysfs_add_hstate(h);
> > > +		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
> > > +					 hstate_kobjs, &hstate_attr_group);
> > >  		if (err)
> > >  			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
> > >  								h->name);
> > >  	}
> > >  }
> > >  
> > > +#ifdef CONFIG_NUMA
> > > +static struct attribute *per_node_hstate_attrs[] = {
> > > +	&nr_hugepages_attr.attr,
> > > +	&free_hugepages_attr.attr,
> > > +	&surplus_hugepages_attr.attr,
> > > +	NULL,
> > > +};
> > > +
> > > +static struct attribute_group per_node_hstate_attr_group = {
> > > +	.attrs = per_node_hstate_attrs,
> > > +};
> > > +
> > > +
> > > +void hugetlb_unregister_node(struct node *node)
> > > +{
> > > +	struct hstate *h;
> > > +
> > > +	for_each_hstate(h) {
> > > +		kobject_put(node->hstate_kobjs[h - hstates]);
> > > +		node->hstate_kobjs[h - hstates] = NULL;
> > > +	}
> > > +
> > > +	kobject_put(node->hugepages_kobj);
> > > +	node->hugepages_kobj = NULL;
> > > +}
> > > +
> > > +static void hugetlb_unregister_all_nodes(void)
> > > +{
> > > +	int nid;
> > > +
> > > +	for (nid = 0; nid < nr_node_ids; nid++)
> > > +		hugetlb_unregister_node(&node_devices[nid]);
> > > +}
> > > +
> > > +void hugetlb_register_node(struct node *node)
> > > +{
> > > +	struct hstate *h;
> > > +	int err;
> > > +
> > > +	if (!hugepages_kobj)
> > > +		return;		/* too early */
> > > +
> > > +	node->hugepages_kobj = kobject_create_and_add("hugepages",
> > > +							&node->sysdev.kobj);
> > > +	if (!node->hugepages_kobj)
> > > +		return;
> > > +
> > > +	for_each_hstate(h) {
> > > +		err = hugetlb_sysfs_add_hstate(h, node->hugepages_kobj,
> > > +						node->hstate_kobjs,
> > > +						&per_node_hstate_attr_group);
> > > +		if (err)
> > > +			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
> > > +					" for node %d\n",
> > > +						h->name, node->sysdev.id);
> > > +	}
> > > +}
> > > +
> > > +static void hugetlb_register_all_nodes(void)
> > > +{
> > > +	int nid;
> > > +
> > > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > > +		struct node *node = &node_devices[nid];
> > > +		if (node->sysdev.id == nid && !node->hugepages_kobj)
> > > +			hugetlb_register_node(node);
> > > +	}
> > > +}
> > > +#endif
> > > +
> > >  static void __exit hugetlb_exit(void)
> > >  {
> > >  	struct hstate *h;
> > >  
> > > +	hugetlb_unregister_all_nodes();
> > > +
> > >  	for_each_hstate(h) {
> > >  		kobject_put(hstate_kobjs[h - hstates]);
> > >  	}
> > > @@ -1505,6 +1662,8 @@ static int __init hugetlb_init(void)
> > >  
> > >  	hugetlb_sysfs_init();
> > >  
> > > +	hugetlb_register_all_nodes();
> > > +
> > >  	return 0;
> > >  }
> > >  module_init(hugetlb_init);
> > > @@ -1607,7 +1766,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
> > >  	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
> > >  
> > >  	if (write)
> > > -		h->max_huge_pages = set_max_huge_pages(h, tmp);
> > > +		h->max_huge_pages = set_max_huge_pages(h, tmp, -1);
> > >  
> > >  	return 0;
> > >  }
> > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h
> > > ===================================================================
> > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h	2009-08-24 12:12:44.000000000 -0400
> > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h	2009-08-24 12:12:56.000000000 -0400
> > > @@ -21,9 +21,12 @@
> > >  
> > >  #include <linux/sysdev.h>
> > >  #include <linux/cpumask.h>
> > > +#include <linux/hugetlb.h>
> > >  
> > >  struct node {
> > >  	struct sys_device	sysdev;
> > > +	struct kobject		*hugepages_kobj;
> > > +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
> > >  };
> > >  
> > >  struct memory_block;
> > > 
> > 
> > I'm not against this idea and think it can work side-by-side with the memory
> > policies. I believe it does need a bit more cleaning up before merging
> > though. I also wasn't able to test this yet due to various build and
> > deploy issues.
> 
> OK.  I'll do the cleanup.   I have tested this atop the mempolicy
> version by working around the build issues that I thought were just
> temporary glitches in the mmotm series.  In my [limited] experience, one
> can interleave numactl+hugeadm with setting values via the per node
> attributes and it does the right thing.  No heavy testing with racing
> tasks, tho'.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
  2009-08-25 20:49       ` Lee Schermerhorn
@ 2009-08-26 10:12         ` Mel Gorman
  -1 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2009-08-26 10:12 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, Aug 25, 2009 at 04:49:40PM -0400, Lee Schermerhorn wrote:
> On Tue, 2009-08-25 at 14:35 +0100, Mel Gorman wrote:
> > On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote:
> > > <SNIP>
> > >
> > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h
> > > ===================================================================
> > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h	2009-08-24 12:12:44.000000000 -0400
> > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h	2009-08-24 12:12:56.000000000 -0400
> > > @@ -21,9 +21,12 @@
> > >  
> > >  #include <linux/sysdev.h>
> > >  #include <linux/cpumask.h>
> > > +#include <linux/hugetlb.h>
> > >  
> > 
> > Is this header inclusion necessary? It does not appear to be required by
> > the structure modification (which is iffy in itself as discussed in the
> > earlier mail) and it breaks build on x86-64.
> 
> Hi, Mel:
> 
> I recall that it is necessary to build.  You can try w/o it.
> 

I did, it appeared to work but I didn't dig deep as to why.

> > 
> >  CC      arch/x86/kernel/setup_percpu.o
> > In file included from include/linux/pagemap.h:10,
> >                  from include/linux/mempolicy.h:62,
> >                  from include/linux/hugetlb.h:8,
> >                  from include/linux/node.h:24,
> >                  from include/linux/cpu.h:23,
> >                  from /usr/local/autobench/var/tmp/build/arch/x86/include/asm/cpu.h:5,
> >                  from arch/x86/kernel/setup_percpu.c:19:
> > include/linux/highmem.h:53: error: static declaration of kmap follows non-static declaration
> > /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:60: error: previous declaration of kmap was here
> > include/linux/highmem.h:59: error: static declaration of kunmap follows non-static declaration
> > /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:61: error: previous declaration of kunmap was here
> > include/linux/highmem.h:63: error: static declaration of kmap_atomic follows non-static declaration
> > /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:63: error: previous declaration of kmap_atomic was here
> > make[2]: *** [arch/x86/kernel/setup_percpu.o] Error 1
> > make[1]: *** [arch/x86/kernel] Error 2
> 
> I saw this.  I've been testing on x86_64.  I *thought* that it only
> started showing up in a recent mmotm from changes in the linux-next
> patch--e.g., a failure to set ARCH_HAS_KMAP or to handle appropriately
> !ARCH_HAS_KMAP in highmem.h  But maybe that was coincidental with my
> adding the include.
> 

Maybe we were looking at different mmotm's

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
@ 2009-08-26 10:12         ` Mel Gorman
  0 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2009-08-26 10:12 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, Aug 25, 2009 at 04:49:40PM -0400, Lee Schermerhorn wrote:
> On Tue, 2009-08-25 at 14:35 +0100, Mel Gorman wrote:
> > On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote:
> > > <SNIP>
> > >
> > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h
> > > ===================================================================
> > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h	2009-08-24 12:12:44.000000000 -0400
> > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h	2009-08-24 12:12:56.000000000 -0400
> > > @@ -21,9 +21,12 @@
> > >  
> > >  #include <linux/sysdev.h>
> > >  #include <linux/cpumask.h>
> > > +#include <linux/hugetlb.h>
> > >  
> > 
> > Is this header inclusion necessary? It does not appear to be required by
> > the structure modification (which is iffy in itself as discussed in the
> > earlier mail) and it breaks build on x86-64.
> 
> Hi, Mel:
> 
> I recall that it is necessary to build.  You can try w/o it.
> 

I did, it appeared to work but I didn't dig deep as to why.

> > 
> >  CC      arch/x86/kernel/setup_percpu.o
> > In file included from include/linux/pagemap.h:10,
> >                  from include/linux/mempolicy.h:62,
> >                  from include/linux/hugetlb.h:8,
> >                  from include/linux/node.h:24,
> >                  from include/linux/cpu.h:23,
> >                  from /usr/local/autobench/var/tmp/build/arch/x86/include/asm/cpu.h:5,
> >                  from arch/x86/kernel/setup_percpu.c:19:
> > include/linux/highmem.h:53: error: static declaration of kmap follows non-static declaration
> > /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:60: error: previous declaration of kmap was here
> > include/linux/highmem.h:59: error: static declaration of kunmap follows non-static declaration
> > /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:61: error: previous declaration of kunmap was here
> > include/linux/highmem.h:63: error: static declaration of kmap_atomic follows non-static declaration
> > /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:63: error: previous declaration of kmap_atomic was here
> > make[2]: *** [arch/x86/kernel/setup_percpu.o] Error 1
> > make[1]: *** [arch/x86/kernel] Error 2
> 
> I saw this.  I've been testing on x86_64.  I *thought* that it only
> started showing up in a recent mmotm from changes in the linux-next
> patch--e.g., a failure to set ARCH_HAS_KMAP or to handle appropriately
> !ARCH_HAS_KMAP in highmem.h  But maybe that was coincidental with my
> adding the include.
> 

Maybe we were looking at different mmotm's

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
  2009-08-26 10:11         ` Mel Gorman
@ 2009-08-26 18:02           ` Lee Schermerhorn
  -1 siblings, 0 replies; 51+ messages in thread
From: Lee Schermerhorn @ 2009-08-26 18:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Wed, 2009-08-26 at 11:11 +0100, Mel Gorman wrote:
> On Tue, Aug 25, 2009 at 04:49:29PM -0400, Lee Schermerhorn wrote:
> > > > 
> > > > +static nodemask_t *nodes_allowed_from_node(int nid)
> > > > +{
> > > 
> > > This name is a bit weird. It's creating a nodemask with just a single
> > > node allowed.
> > > 
> > > Is there something wrong with using the existing function
> > > nodemask_of_node()? If stack is the problem, prehaps there is some macro
> > > magic that would allow a nodemask to be either declared on the stack or
> > > kmalloc'd.
> > 
> > Yeah.  nodemask_of_node() creates an on-stack mask, invisibly, in a
> > block nested inside the context where it's invoked.  I would be
> > declaring the nodemask in the compound else clause and don't want to
> > access it [via the nodes_allowed pointer] from outside of there.
> > 
> 
> So, the existance of the mask on the stack is the problem. I can
> understand that, they are potentially quite large.
> 
> Would it be possible to add a helper along side it like
> init_nodemask_of_node() that does the same work as nodemask_of_node()
> but takes a nodemask parameter? nodemask_of_node() would reuse the
> init_nodemask_of_node() except it declares the nodemask on the stack.
> 

<snip>

Here's the patch that introduces the helper function that I propose.
I'll send an update of the subject patch that uses this macro and, I
think, addresses your other issues via a separate message.  This patch
applies just before the "register per node attributes" patch.  Once we
can agree on these [or subsequent] changes, I'll repost the entire
updated series.

Lee

---

PATCH 4/6 - hugetlb:  introduce alloc_nodemask_of_node()

Against: 2.6.31-rc6-mmotm-090820-1918

Introduce nodemask macro to allocate a nodemask and 
initialize it to contain a single node, using existing
nodemask_of_node() macro.  Coded as a macro to avoid header
dependency hell.

This will be used to construct the huge pages "nodes_allowed"
nodemask for a single node when a persistent huge page
pool page count is modified via a per node sysfs attribute.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/nodemask.h |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h
===================================================================
--- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/nodemask.h	2009-08-24 10:16:56.000000000 -0400
+++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h	2009-08-26 12:38:31.000000000 -0400
@@ -257,6 +257,16 @@ static inline int __next_node(int n, con
 	m;								\
 })
 
+#define alloc_nodemask_of_node(node)					\
+({									\
+	typeof(_unused_nodemask_arg_) *nmp;				\
+	nmp = kmalloc(sizeof(*nmp), GFP_KERNEL);			\
+	if (nmp)							\
+		*nmp = nodemask_of_node(node);				\
+	nmp;								\
+})
+
+
 #define first_unset_node(mask) __first_unset_node(&(mask))
 static inline int __first_unset_node(const nodemask_t *maskp)
 {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
@ 2009-08-26 18:02           ` Lee Schermerhorn
  0 siblings, 0 replies; 51+ messages in thread
From: Lee Schermerhorn @ 2009-08-26 18:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Wed, 2009-08-26 at 11:11 +0100, Mel Gorman wrote:
> On Tue, Aug 25, 2009 at 04:49:29PM -0400, Lee Schermerhorn wrote:
> > > > 
> > > > +static nodemask_t *nodes_allowed_from_node(int nid)
> > > > +{
> > > 
> > > This name is a bit weird. It's creating a nodemask with just a single
> > > node allowed.
> > > 
> > > Is there something wrong with using the existing function
> > > nodemask_of_node()? If stack is the problem, prehaps there is some macro
> > > magic that would allow a nodemask to be either declared on the stack or
> > > kmalloc'd.
> > 
> > Yeah.  nodemask_of_node() creates an on-stack mask, invisibly, in a
> > block nested inside the context where it's invoked.  I would be
> > declaring the nodemask in the compound else clause and don't want to
> > access it [via the nodes_allowed pointer] from outside of there.
> > 
> 
> So, the existance of the mask on the stack is the problem. I can
> understand that, they are potentially quite large.
> 
> Would it be possible to add a helper along side it like
> init_nodemask_of_node() that does the same work as nodemask_of_node()
> but takes a nodemask parameter? nodemask_of_node() would reuse the
> init_nodemask_of_node() except it declares the nodemask on the stack.
> 

<snip>

Here's the patch that introduces the helper function that I propose.
I'll send an update of the subject patch that uses this macro and, I
think, addresses your other issues via a separate message.  This patch
applies just before the "register per node attributes" patch.  Once we
can agree on these [or subsequent] changes, I'll repost the entire
updated series.

Lee

---

PATCH 4/6 - hugetlb:  introduce alloc_nodemask_of_node()

Against: 2.6.31-rc6-mmotm-090820-1918

Introduce nodemask macro to allocate a nodemask and 
initialize it to contain a single node, using existing
nodemask_of_node() macro.  Coded as a macro to avoid header
dependency hell.

This will be used to construct the huge pages "nodes_allowed"
nodemask for a single node when a persistent huge page
pool page count is modified via a per node sysfs attribute.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/nodemask.h |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h
===================================================================
--- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/nodemask.h	2009-08-24 10:16:56.000000000 -0400
+++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h	2009-08-26 12:38:31.000000000 -0400
@@ -257,6 +257,16 @@ static inline int __next_node(int n, con
 	m;								\
 })
 
+#define alloc_nodemask_of_node(node)					\
+({									\
+	typeof(_unused_nodemask_arg_) *nmp;				\
+	nmp = kmalloc(sizeof(*nmp), GFP_KERNEL);			\
+	if (nmp)							\
+		*nmp = nodemask_of_node(node);				\
+	nmp;								\
+})
+
+
 #define first_unset_node(mask) __first_unset_node(&(mask))
 static inline int __first_unset_node(const nodemask_t *maskp)
 {



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
  2009-08-26 10:11         ` Mel Gorman
  (?)
  (?)
@ 2009-08-26 18:04         ` Lee Schermerhorn
  2009-08-27 10:23           ` Mel Gorman
  -1 siblings, 1 reply; 51+ messages in thread
From: Lee Schermerhorn @ 2009-08-26 18:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

Proposed revised patch attached.  Some comments in-line...

On Wed, 2009-08-26 at 11:11 +0100, Mel Gorman wrote:
> On Tue, Aug 25, 2009 at 04:49:29PM -0400, Lee Schermerhorn wrote:
> > > > 
> > > > +static nodemask_t *nodes_allowed_from_node(int nid)
> > > > +{
> > > 
> > > This name is a bit weird. It's creating a nodemask with just a single
> > > node allowed.
> > > 
> > > Is there something wrong with using the existing function
> > > nodemask_of_node()? If stack is the problem, prehaps there is some macro
> > > magic that would allow a nodemask to be either declared on the stack or
> > > kmalloc'd.
> > 
> > Yeah.  nodemask_of_node() creates an on-stack mask, invisibly, in a
> > block nested inside the context where it's invoked.  I would be
> > declaring the nodemask in the compound else clause and don't want to
> > access it [via the nodes_allowed pointer] from outside of there.
> > 
> 
> So, the existance of the mask on the stack is the problem. I can
> understand that, they are potentially quite large.
> 
> Would it be possible to add a helper along side it like
> init_nodemask_of_node() that does the same work as nodemask_of_node()
> but takes a nodemask parameter? nodemask_of_node() would reuse the
> init_nodemask_of_node() except it declares the nodemask on the stack.

Now use "alloc_nodemask_of_node()" to alloc/init a nodemask with a
single node.  

<snip>

> > > > -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> > > > +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> > > > +								int nid)
> > > >  {
> > > >  	unsigned long min_count, ret;
> > > >  	nodemask_t *nodes_allowed;
> > > > @@ -1262,7 +1279,17 @@ static unsigned long set_max_huge_pages(
> > > >  	if (h->order >= MAX_ORDER)
> > > >  		return h->max_huge_pages;
> > > >  
> > > > -	nodes_allowed = huge_mpol_nodes_allowed();
> > > > +	if (nid < 0)
> > > > +		nodes_allowed = huge_mpol_nodes_allowed();
> > > 
> > > hugetlb is a bit littered with magic numbers been passed into functions.
> > > Attempts have been made to clear them up as according as patches change
> > > that area. Would it be possible to define something like
> > > 
> > > #define HUGETLB_OBEY_MEMPOLICY -1
> > > 
> > > for the nid here as opposed to passing in -1? I know -1 is used in the page
> > > allocator functions but there it means "current node" and here it means
> > > "obey mempolicies".
> > 
> > Well, here it means, NO_NODE_ID_SPECIFIED or, "we didn't get here via a
> > per node attribute".  It means "derive nodes allowed from memory policy,
> > if non-default, else use nodes_online_map" [which is not exactly the
> > same as obeying memory policy].
> > 
> > But, I can see defining a symbolic constant such as
> > NO_NODE[_ID_SPECIFIED].  I'll try next spin.
> > 
> 
> That NO_NODE_ID_SPECIFIED was the underlying definition I was looking
> for. It makes sense at both sites.

Done.

> 
> > > > -static struct hstate *kobj_to_hstate(struct kobject *kobj)
> > > > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> > > > +{
> > > > +	int nid;
> > > > +
> > > > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > > > +		struct node *node = &node_devices[nid];
> > > > +		int hi;
> > > > +		for (hi = 0; hi < HUGE_MAX_HSTATE; hi++)
> > > 
> > > Does that hi mean hello, high, nid or hstate_idx?
> > > 
> > > hstate_idx would appear to be the appropriate name here.
> > 
> > Or just plain 'i', like in the following, pre-existing function?
> > 
> 
> Whichever suits you best. If hstate_idx is really what it is, I see no
> harm in using it but 'i' is an index and I'd sooner recognise that than
> the less meaningful "hi".

Changed to 'i'

<snip>

> > > 
> > > Ok.... so, there is a struct node array for the sysdev and this patch adds
> > > references to the "hugepages" directory kobject and the subdirectories for
> > > each page size. We walk all the objects until we find a match. Obviously,
> > > this adds a dependency of base node support on hugetlbfs which feels backwards
> > > and you call that out in your leader.
> > > 
> > > Can this be the other way around? i.e. The struct hstate has an array of
> > > kobjects arranged by nid that is filled in when the node is registered?
> > > There will only be one kobject-per-pagesize-per-node so it seems like it
> > > would work. I confess, I haven't prototyped this to be 100% sure.
> > 
> > This will take a bit longer to sort out.  I do want to change the
> > registration, tho', so that hugetlb.c registers it's single node
> > register/unregister functions with base/node.c to remove the source
> > level dependency in that direction.  node.c will only register nodes on
> > hot plug as it's initialized too early, relative to hugetlb.c to
> > register them at init time.   This should break the call dependency of
> > base/node.c on the hugetlb module.
> > 
> > As far as moving the per node attributes' kobjects to the hugetlb global
> > hstate arrays...  Have to think about that.  I agree that it would be
> > nice to remove the source level [header] dependency.
> > 
> 
> FWIW, I see no problem with the mempolicy stuff going ahead separately from
> this patch after the few relatively minor cleanups highlighted in the thread
> and tackling this patch as a separate cycle. It's up to you really.

I took a look at it and propose the attached rework.  I moved all of the
per node per hstate kobj pointers to hugetlb.c.  hugetlb.c now registers
its single node register/unregister functions with base/node.c to
suppport hot-plug.   If hugetlbfs never registers with node.c, it will
never try to register.  This patch applies atop the "introduce
alloc_nodemask_of_node()" patch I sent earlier.  Let me know what you
think.

<snip>
> > > >  
> > > >  static ssize_t nr_hugepages_show(struct kobject *kobj,
> > > >  					struct kobj_attribute *attr, char *buf)
> > > >  {
> > > > -	struct hstate *h = kobj_to_hstate(kobj);
> > > > -	return sprintf(buf, "%lu\n", h->nr_huge_pages);
> > > > +	struct hstate *h;
> > > > +	unsigned long nr_huge_pages;
> > > > +	int nid;
> > > > +
> > > > +	h = kobj_to_hstate(kobj, &nid);
> > > > +	if (nid < 0)
> > > > +		nr_huge_pages = h->nr_huge_pages;
> > > 
> > > Here is another magic number except it means something slightly
> > > different. It means NR_GLOBAL_HUGEPAGES or something similar. It would
> > > be nice if these different special nid values could be named, preferably
> > > collapsed to being one "core" thing.
> > 
> > Again, it means "NO NODE ID specified" [via per node attribute].  Again,
> > I'll address this with a single constant.

Fixed.

> > 
> > > 
> > > > +	else
> > > > +		nr_huge_pages = h->nr_huge_pages_node[nid];
> > > > +
> > > > +	return sprintf(buf, "%lu\n", nr_huge_pages);
> > > >  }
> > > > +
> > > >  static ssize_t nr_hugepages_store(struct kobject *kobj,
> > > >  		struct kobj_attribute *attr, const char *buf, size_t count)
> > > >  {
> > > > -	int err;
> > > >  	unsigned long input;
> > > > -	struct hstate *h = kobj_to_hstate(kobj);
> > > > +	struct hstate *h;
> > > > +	int nid;
> > > > +	int err;
> > > >  
> > > >  	err = strict_strtoul(buf, 10, &input);
> > > >  	if (err)
> > > >  		return 0;
> > > >  
> > > > -	h->max_huge_pages = set_max_huge_pages(h, input);
> > > 
> > > "input" is a bit meaningless. The function you are passing to calls this
> > > parameter "count". Can you match the naming please? Otherwise, I might
> > > guess that this is a "delta" which occurs elsewhere in the hugetlb code.
> > 
> > I guess I can change that.  It's the pre-exiting name, and 'count' was
> > already used.  Guess I can change 'count' to 'len' and 'input' to
> > 'count'
> 
> Makes sense.

fixed.

> 
> > > 
> > > > +	h = kobj_to_hstate(kobj, &nid);
> > > > +	h->max_huge_pages = set_max_huge_pages(h, input, nid);
> > > >  
> > > >  	return count;
> > > >  }

<snip>

> > > I'm not against this idea and think it can work side-by-side with the memory
> > > policies. I believe it does need a bit more cleaning up before merging
> > > though. I also wasn't able to test this yet due to various build and
> > > deploy issues.
> > 
> > OK.  I'll do the cleanup.   I have tested this atop the mempolicy
> > version by working around the build issues that I thought were just
> > temporary glitches in the mmotm series.  In my [limited] experience, one
> > can interleave numactl+hugeadm with setting values via the per node
> > attributes and it does the right thing.  No heavy testing with racing
> > tasks, tho'.
> > 

This revised patch also removes the include of hugetlb.h from node.h.

Lee

---

PATCH 5/6 hugetlb:  register per node hugepages attributes

Against: 2.6.31-rc6-mmotm-090820-1918

V2:  remove dependency on kobject private bitfield.  Search
     global hstates then all per node hstates for kobject
     match in attribute show/store functions.

V3:  rebase atop the mempolicy-based hugepage alloc/free;
     use custom "nodes_allowed" to restrict alloc/free to
     a specific node via per node attributes.  Per node
     attribute overrides mempolicy.  I.e., mempolicy only
     applies to global attributes.

V4:  Fix issues raised by Mel Gorman:
     + add !NUMA versions of hugetlb_[un]register_node()
     + rename 'hi' to 'i' in kobj_to_node_hstate()
     + rename (count, input) to (len, count) in nr_hugepages_store()
     + moved per node hugepages_kobj and hstate_kobjs[] from the
       struct node [sysdev] to hugetlb.c private arrays.
     + changed registration mechanism so that hugetlbfs [a module]
       register its attributes registration callbacks with the node
       driver, eliminating the dependency between the node driver
       and hugetlbfs.  From it's init func, hugetlbfs will register
       all on-line nodes' hugepage sysfs attributes along with
       hugetlbfs' attributes register/unregister functions.  The
       node driver will use these functions to [un]register nodes
       with hugetlbfs on node hot-plug.
     + replaced hugetlb.c private "nodes_allowed_from_node()" with
       generic "alloc_nodemask_of_node()".

This patch adds the per huge page size control/query attributes
to the per node sysdevs:

/sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
	nr_hugepages       - r/w
	free_huge_pages    - r/o
	surplus_huge_pages - r/o

The patch attempts to re-use/share as much of the existing
global hstate attribute initialization and handling, and the
"nodes_allowed" constraint processing as possible.
Calling set_max_huge_pages() with no node indicates a change to
global hstate parameters.  In this case, any non-default task
mempolicy will be used to generate the nodes_allowed mask.  A
valid node id indicates an update to that node's hstate 
parameters, and the count argument specifies the target count
for the specified node.  From this info, we compute the target
global count for the hstate and construct a nodes_allowed node
mask contain only the specified node.

Setting the node specific nr_hugepages via the per node attribute
effectively ignores any task mempolicy or cpuset constraints.

With this patch:

(me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
./  ../  free_hugepages  nr_hugepages  surplus_hugepages

Starting from:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     0
Node 2 HugePages_Free:      0
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 0

Allocate 16 persistent huge pages on node 2:
(me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

[Note that this is equivalent to:
	numactl -m 2 hugeadmin --pool-pages-min 2M:+16
]

Yields:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:    16
Node 2 HugePages_Free:     16
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 16

Global controls work as expected--reduce pool to 8 persistent huge pages:
(me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     8
Node 2 HugePages_Free:      8
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 drivers/base/node.c  |   27 +++++
 include/linux/node.h |    6 +
 include/linux/numa.h |    2 
 mm/hugetlb.c         |  245 ++++++++++++++++++++++++++++++++++++++++++++-------
 4 files changed, 250 insertions(+), 30 deletions(-)

Index: linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c
===================================================================
--- linux-2.6.31-rc6-mmotm-090820-1918.orig/drivers/base/node.c	2009-08-26 12:37:03.000000000 -0400
+++ linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c	2009-08-26 13:01:54.000000000 -0400
@@ -177,6 +177,31 @@ static ssize_t node_read_distance(struct
 }
 static SYSDEV_ATTR(distance, S_IRUGO, node_read_distance, NULL);
 
+/*
+ * hugetlbfs per node attributes registration interface
+ */
+NODE_REGISTRATION_FUNC __hugetlb_register_node;
+NODE_REGISTRATION_FUNC __hugetlb_unregister_node;
+
+static inline void hugetlb_register_node(struct node *node)
+{
+	if (__hugetlb_register_node)
+		__hugetlb_register_node(node);
+}
+
+static inline void hugetlb_unregister_node(struct node *node)
+{
+	if (__hugetlb_unregister_node)
+		__hugetlb_unregister_node(node);
+}
+
+void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC doregister,
+                                  NODE_REGISTRATION_FUNC unregister)
+{
+	__hugetlb_register_node   = doregister;
+	__hugetlb_unregister_node = unregister;
+}
+
 
 /*
  * register_node - Setup a sysfs device for a node.
@@ -200,6 +225,7 @@ int register_node(struct node *node, int
 		sysdev_create_file(&node->sysdev, &attr_distance);
 
 		scan_unevictable_register_node(node);
+		hugetlb_register_node(node);
 	}
 	return error;
 }
@@ -220,6 +246,7 @@ void unregister_node(struct node *node)
 	sysdev_remove_file(&node->sysdev, &attr_distance);
 
 	scan_unevictable_unregister_node(node);
+	hugetlb_unregister_node(node);
 
 	sysdev_unregister(&node->sysdev);
 }
Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-26 12:37:04.000000000 -0400
+++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-26 13:01:54.000000000 -0400
@@ -24,6 +24,7 @@
 #include <asm/io.h>
 
 #include <linux/hugetlb.h>
+#include <linux/node.h>
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
@@ -1245,7 +1246,8 @@ static int adjust_pool_surplus(struct hs
 }
 
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
-static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
+static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
+								int nid)
 {
 	unsigned long min_count, ret;
 	nodemask_t *nodes_allowed;
@@ -1253,7 +1255,21 @@ static unsigned long set_max_huge_pages(
 	if (h->order >= MAX_ORDER)
 		return h->max_huge_pages;
 
-	nodes_allowed = huge_mpol_nodes_allowed();
+	if (nid == NO_NODEID_SPECIFIED)
+		nodes_allowed = huge_mpol_nodes_allowed();
+	else {
+		/*
+		 * incoming 'count' is for node 'nid' only, so
+		 * adjust count to global, but restrict alloc/free
+		 * to the specified node.
+		 */
+		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
+		nodes_allowed = alloc_nodemask_of_node(nid);
+		if (!nodes_allowed)
+			printk(KERN_WARNING "%s unable to allocate allowed "
+			       "nodes mask for huge page allocation/free.  "
+			       "Falling back to default.\n", current->comm);
+	}
 
 	/*
 	 * Increase the pool size
@@ -1329,51 +1345,71 @@ out:
 static struct kobject *hugepages_kobj;
 static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
 
-static struct hstate *kobj_to_hstate(struct kobject *kobj)
+static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp);
+
+static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
 {
 	int i;
+
 	for (i = 0; i < HUGE_MAX_HSTATE; i++)
-		if (hstate_kobjs[i] == kobj)
+		if (hstate_kobjs[i] == kobj) {
+			if (nidp)
+				*nidp = NO_NODEID_SPECIFIED;
 			return &hstates[i];
-	BUG();
-	return NULL;
+		}
+
+	return kobj_to_node_hstate(kobj, nidp);
 }
 
 static ssize_t nr_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->nr_huge_pages);
+	struct hstate *h;
+	unsigned long nr_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid == NO_NODEID_SPECIFIED)
+		nr_huge_pages = h->nr_huge_pages;
+	else
+		nr_huge_pages = h->nr_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", nr_huge_pages);
 }
+
 static ssize_t nr_hugepages_store(struct kobject *kobj,
-		struct kobj_attribute *attr, const char *buf, size_t count)
+		struct kobj_attribute *attr, const char *buf, size_t len)
 {
+	unsigned long count;
+	struct hstate *h;
+	int nid;
 	int err;
-	unsigned long input;
-	struct hstate *h = kobj_to_hstate(kobj);
 
-	err = strict_strtoul(buf, 10, &input);
+	err = strict_strtoul(buf, 10, &count);
 	if (err)
 		return 0;
 
-	h->max_huge_pages = set_max_huge_pages(h, input);
+	h = kobj_to_hstate(kobj, &nid);
+	h->max_huge_pages = set_max_huge_pages(h, count, nid);
 
-	return count;
+	return len;
 }
 HSTATE_ATTR(nr_hugepages);
 
 static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
+
 	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
 }
+
 static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
 		struct kobj_attribute *attr, const char *buf, size_t count)
 {
 	int err;
 	unsigned long input;
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
 
 	err = strict_strtoul(buf, 10, &input);
 	if (err)
@@ -1390,15 +1426,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
 static ssize_t free_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->free_huge_pages);
+	struct hstate *h;
+	unsigned long free_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid == NO_NODEID_SPECIFIED)
+		free_huge_pages = h->free_huge_pages;
+	else
+		free_huge_pages = h->free_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", free_huge_pages);
 }
 HSTATE_ATTR_RO(free_hugepages);
 
 static ssize_t resv_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
 	return sprintf(buf, "%lu\n", h->resv_huge_pages);
 }
 HSTATE_ATTR_RO(resv_hugepages);
@@ -1406,8 +1451,17 @@ HSTATE_ATTR_RO(resv_hugepages);
 static ssize_t surplus_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
+	struct hstate *h;
+	unsigned long surplus_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid == NO_NODEID_SPECIFIED)
+		surplus_huge_pages = h->surplus_huge_pages;
+	else
+		surplus_huge_pages = h->surplus_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", surplus_huge_pages);
 }
 HSTATE_ATTR_RO(surplus_hugepages);
 
@@ -1424,19 +1478,21 @@ static struct attribute_group hstate_att
 	.attrs = hstate_attrs,
 };
 
-static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
+static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
+				struct kobject *parent,
+				struct kobject **hstate_kobjs,
+				struct attribute_group *hstate_attr_group)
 {
 	int retval;
+	int hi = h - hstates;
 
-	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
-							hugepages_kobj);
-	if (!hstate_kobjs[h - hstates])
+	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
+	if (!hstate_kobjs[hi])
 		return -ENOMEM;
 
-	retval = sysfs_create_group(hstate_kobjs[h - hstates],
-							&hstate_attr_group);
+	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
 	if (retval)
-		kobject_put(hstate_kobjs[h - hstates]);
+		kobject_put(hstate_kobjs[hi]);
 
 	return retval;
 }
@@ -1451,17 +1507,143 @@ static void __init hugetlb_sysfs_init(vo
 		return;
 
 	for_each_hstate(h) {
-		err = hugetlb_sysfs_add_hstate(h);
+		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
+					 hstate_kobjs, &hstate_attr_group);
 		if (err)
 			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
 								h->name);
 	}
 }
 
+#ifdef CONFIG_NUMA
+
+struct node_hstate {
+	struct kobject		*hugepages_kobj;
+	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
+};
+struct node_hstate node_hstates[MAX_NUMNODES];
+
+static struct attribute *per_node_hstate_attrs[] = {
+	&nr_hugepages_attr.attr,
+	&free_hugepages_attr.attr,
+	&surplus_hugepages_attr.attr,
+	NULL,
+};
+
+static struct attribute_group per_node_hstate_attr_group = {
+	.attrs = per_node_hstate_attrs,
+};
+
+static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		struct node_hstate *nhs = &node_hstates[nid];
+		int i;
+		for (i = 0; i < HUGE_MAX_HSTATE; i++)
+			if (nhs->hstate_kobjs[i] == kobj) {
+				if (nidp)
+					*nidp = nid;
+				return &hstates[i];
+			}
+	}
+
+	BUG();
+	return NULL;
+}
+
+void hugetlb_unregister_node(struct node *node)
+{
+	struct hstate *h;
+	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
+
+	if (!nhs->hugepages_kobj)
+		return;
+
+	for_each_hstate(h)
+		if (nhs->hstate_kobjs[h - hstates]) {
+			kobject_put(nhs->hstate_kobjs[h - hstates]);
+			nhs->hstate_kobjs[h - hstates] = NULL;
+		}
+
+	kobject_put(nhs->hugepages_kobj);
+	nhs->hugepages_kobj = NULL;
+}
+
+static void hugetlb_unregister_all_nodes(void)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++)
+		hugetlb_unregister_node(&node_devices[nid]);
+
+	register_hugetlbfs_with_node(NULL, NULL);
+}
+
+void hugetlb_register_node(struct node *node)
+{
+	struct hstate *h;
+	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
+	int err;
+
+	if (nhs->hugepages_kobj)
+		return;		/* already allocated */
+
+	nhs->hugepages_kobj = kobject_create_and_add("hugepages",
+							&node->sysdev.kobj);
+	if (!nhs->hugepages_kobj)
+		return;
+
+	for_each_hstate(h) {
+		err = hugetlb_sysfs_add_hstate(h, nhs->hugepages_kobj,
+						nhs->hstate_kobjs,
+						&per_node_hstate_attr_group);
+		if (err) {
+			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
+					" for node %d\n",
+						h->name, node->sysdev.id);
+			hugetlb_unregister_node(node);
+			break;
+		}
+	}
+}
+
+static void hugetlb_register_all_nodes(void)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		struct node *node = &node_devices[nid];
+		if (node->sysdev.id == nid)
+			hugetlb_register_node(node);
+	}
+
+	register_hugetlbfs_with_node(hugetlb_register_node,
+                                     hugetlb_unregister_node);
+}
+#else	/* !CONFIG_NUMA */
+
+static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
+{
+	BUG();
+	if (nidp)
+		*nidp = -1;
+	return NULL;
+}
+
+static void hugetlb_unregister_all_nodes(void) { }
+
+static void hugetlb_register_all_nodes(void) { }
+
+#endif
+
 static void __exit hugetlb_exit(void)
 {
 	struct hstate *h;
 
+	hugetlb_unregister_all_nodes();
+
 	for_each_hstate(h) {
 		kobject_put(hstate_kobjs[h - hstates]);
 	}
@@ -1496,6 +1678,8 @@ static int __init hugetlb_init(void)
 
 	hugetlb_sysfs_init();
 
+	hugetlb_register_all_nodes();
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -1598,7 +1782,8 @@ int hugetlb_sysctl_handler(struct ctl_ta
 	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
 
 	if (write)
-		h->max_huge_pages = set_max_huge_pages(h, tmp);
+		h->max_huge_pages = set_max_huge_pages(h, tmp,
+		                                       NO_NODEID_SPECIFIED);
 
 	return 0;
 }
Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/numa.h
===================================================================
--- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/numa.h	2009-08-26 12:37:03.000000000 -0400
+++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/numa.h	2009-08-26 12:58:54.000000000 -0400
@@ -10,4 +10,6 @@
 
 #define MAX_NUMNODES    (1 << NODES_SHIFT)
 
+#define NO_NODEID_SPECIFIED	(-1)
+
 #endif /* _LINUX_NUMA_H */
Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h
===================================================================
--- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h	2009-08-26 12:37:03.000000000 -0400
+++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h	2009-08-26 12:40:19.000000000 -0400
@@ -28,6 +28,7 @@ struct node {
 
 struct memory_block;
 extern struct node node_devices[];
+typedef  void (*NODE_REGISTRATION_FUNC)(struct node *);
 
 extern int register_node(struct node *, int, struct node *);
 extern void unregister_node(struct node *node);
@@ -39,6 +40,8 @@ extern int unregister_cpu_under_node(uns
 extern int register_mem_sect_under_node(struct memory_block *mem_blk,
 						int nid);
 extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk);
+extern void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC doregister,
+                                         NODE_REGISTRATION_FUNC unregister);
 #else
 static inline int register_one_node(int nid)
 {
@@ -65,6 +68,9 @@ static inline int unregister_mem_sect_un
 {
 	return 0;
 }
+
+static inline void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC do,
+                                                NODE_REGISTRATION_FUNC un) { }
 #endif
 
 #define to_node(sys_device) container_of(sys_device, struct node, sysdev)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
  2009-08-26 18:02           ` Lee Schermerhorn
@ 2009-08-26 19:47             ` David Rientjes
  -1 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2009-08-26 19:47 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, linux-mm, linux-numa, akpm, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney

On Wed, 26 Aug 2009, Lee Schermerhorn wrote:

> Against: 2.6.31-rc6-mmotm-090820-1918
> 
> Introduce nodemask macro to allocate a nodemask and 
> initialize it to contain a single node, using existing
> nodemask_of_node() macro.  Coded as a macro to avoid header
> dependency hell.
> 
> This will be used to construct the huge pages "nodes_allowed"
> nodemask for a single node when a persistent huge page
> pool page count is modified via a per node sysfs attribute.
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  include/linux/nodemask.h |   10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/nodemask.h	2009-08-24 10:16:56.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h	2009-08-26 12:38:31.000000000 -0400
> @@ -257,6 +257,16 @@ static inline int __next_node(int n, con
>  	m;								\
>  })
>  
> +#define alloc_nodemask_of_node(node)					\
> +({									\
> +	typeof(_unused_nodemask_arg_) *nmp;				\
> +	nmp = kmalloc(sizeof(*nmp), GFP_KERNEL);			\
> +	if (nmp)							\
> +		*nmp = nodemask_of_node(node);				\
> +	nmp;								\
> +})
> +
> +
>  #define first_unset_node(mask) __first_unset_node(&(mask))
>  static inline int __first_unset_node(const nodemask_t *maskp)
>  {

I think it would probably be better to use the generic NODEMASK_ALLOC() 
interface by requiring it to pass the entire type (including "struct") as 
part of the first parameter.  Then it automatically takes care of 
dynamically allocating large nodemasks vs. allocating them on the stack.

Would it work by redefining NODEMASK_ALLOC() in the NODES_SHIFT > 8 case 
to be this:

	#define NODEMASK_ALLOC(x, m) x *m = kmalloc(sizeof(*m), GFP_KERNEL);

and converting NODEMASK_SCRATCH(x) to NODEMASK_ALLOC(struct 
nodemask_scratch, x), and then doing this in your code:

	NODEMASK_ALLOC(nodemask_t, nodes_allowed);
	if (nodes_allowed)
		*nodes_allowed = nodemask_of_node(node);

The NODEMASK_{ALLOC,SCRATCH}() interface is in its infancy so it can 
probably be made more general to handle cases like this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
@ 2009-08-26 19:47             ` David Rientjes
  0 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2009-08-26 19:47 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, linux-mm, linux-numa, akpm, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney

On Wed, 26 Aug 2009, Lee Schermerhorn wrote:

> Against: 2.6.31-rc6-mmotm-090820-1918
> 
> Introduce nodemask macro to allocate a nodemask and 
> initialize it to contain a single node, using existing
> nodemask_of_node() macro.  Coded as a macro to avoid header
> dependency hell.
> 
> This will be used to construct the huge pages "nodes_allowed"
> nodemask for a single node when a persistent huge page
> pool page count is modified via a per node sysfs attribute.
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  include/linux/nodemask.h |   10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/nodemask.h	2009-08-24 10:16:56.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h	2009-08-26 12:38:31.000000000 -0400
> @@ -257,6 +257,16 @@ static inline int __next_node(int n, con
>  	m;								\
>  })
>  
> +#define alloc_nodemask_of_node(node)					\
> +({									\
> +	typeof(_unused_nodemask_arg_) *nmp;				\
> +	nmp = kmalloc(sizeof(*nmp), GFP_KERNEL);			\
> +	if (nmp)							\
> +		*nmp = nodemask_of_node(node);				\
> +	nmp;								\
> +})
> +
> +
>  #define first_unset_node(mask) __first_unset_node(&(mask))
>  static inline int __first_unset_node(const nodemask_t *maskp)
>  {

I think it would probably be better to use the generic NODEMASK_ALLOC() 
interface by requiring it to pass the entire type (including "struct") as 
part of the first parameter.  Then it automatically takes care of 
dynamically allocating large nodemasks vs. allocating them on the stack.

Would it work by redefining NODEMASK_ALLOC() in the NODES_SHIFT > 8 case 
to be this:

	#define NODEMASK_ALLOC(x, m) x *m = kmalloc(sizeof(*m), GFP_KERNEL);

and converting NODEMASK_SCRATCH(x) to NODEMASK_ALLOC(struct 
nodemask_scratch, x), and then doing this in your code:

	NODEMASK_ALLOC(nodemask_t, nodes_allowed);
	if (nodes_allowed)
		*nodes_allowed = nodemask_of_node(node);

The NODEMASK_{ALLOC,SCRATCH}() interface is in its infancy so it can 
probably be made more general to handle cases like this.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
  2009-08-26 19:47             ` David Rientjes
@ 2009-08-26 20:46               ` Lee Schermerhorn
  -1 siblings, 0 replies; 51+ messages in thread
From: Lee Schermerhorn @ 2009-08-26 20:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, linux-mm, linux-numa, akpm, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney

On Wed, 2009-08-26 at 12:47 -0700, David Rientjes wrote:
> On Wed, 26 Aug 2009, Lee Schermerhorn wrote:
> 
> > Against: 2.6.31-rc6-mmotm-090820-1918
> > 
> > Introduce nodemask macro to allocate a nodemask and 
> > initialize it to contain a single node, using existing
> > nodemask_of_node() macro.  Coded as a macro to avoid header
> > dependency hell.
> > 
> > This will be used to construct the huge pages "nodes_allowed"
> > nodemask for a single node when a persistent huge page
> > pool page count is modified via a per node sysfs attribute.
> > 
> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> >  include/linux/nodemask.h |   10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/nodemask.h	2009-08-24 10:16:56.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h	2009-08-26 12:38:31.000000000 -0400
> > @@ -257,6 +257,16 @@ static inline int __next_node(int n, con
> >  	m;								\
> >  })
> >  
> > +#define alloc_nodemask_of_node(node)					\
> > +({									\
> > +	typeof(_unused_nodemask_arg_) *nmp;				\
> > +	nmp = kmalloc(sizeof(*nmp), GFP_KERNEL);			\
> > +	if (nmp)							\
> > +		*nmp = nodemask_of_node(node);				\
> > +	nmp;								\
> > +})
> > +
> > +
> >  #define first_unset_node(mask) __first_unset_node(&(mask))
> >  static inline int __first_unset_node(const nodemask_t *maskp)
> >  {
> 
> I think it would probably be better to use the generic NODEMASK_ALLOC() 
> interface by requiring it to pass the entire type (including "struct") as 
> part of the first parameter.  Then it automatically takes care of 
> dynamically allocating large nodemasks vs. allocating them on the stack.
> 
> Would it work by redefining NODEMASK_ALLOC() in the NODES_SHIFT > 8 case 
> to be this:
> 
> 	#define NODEMASK_ALLOC(x, m) x *m = kmalloc(sizeof(*m), GFP_KERNEL);
> 
> and converting NODEMASK_SCRATCH(x) to NODEMASK_ALLOC(struct 
> nodemask_scratch, x), and then doing this in your code:
> 
> 	NODEMASK_ALLOC(nodemask_t, nodes_allowed);
> 	if (nodes_allowed)
> 		*nodes_allowed = nodemask_of_node(node);
> 
> The NODEMASK_{ALLOC,SCRATCH}() interface is in its infancy so it can 
> probably be made more general to handle cases like this.

I just don't know what that would accomplish.  Heck, I'm not all that
happy with the alloc_nodemask_from_node() because it's allocating both a
hidden nodemask_t and a pointer thereto on the stack just to return a
pointer to a kmalloc()ed nodemask_t--which is what I want/need here.

One issue I have with NODEMASK_ALLOC() [and nodemask_of_node(), et al]
is that it declares the pointer variable as well as initializing it,
perhaps with kmalloc(), ...   Indeed, it's purpose is to replace on
stack nodemask declarations.

So, to use it at the start of, e.g., set_max_huge_pages() where I can
safely use it throughout the function, I'll end up allocating the
nodes_allowed mask on every call, whether or not a node is specified or
there is a non-default mempolicy.   If it turns out that no node was
specified and we have default policy, we need to free the mask and NULL
out nodes_allowed up front so that we get default behavior.  That seems
uglier to me that only allocating the nodemask when we know we need one.

I'm not opposed to using a generic function/macro where one exists that
suits my purposes.   I just don't see one.  I tried to create
one--alloc_nodemask_from_node(), and to keep Mel happy, I tried to reuse
nodemask_from_node() to initialize it.  I'm really not happy with the
results--because of those extra, hidden stack variables.  I could
eliminate those by creating a out of line function, but there's no good
place to put a generic nodemask function--no nodemask.c.  

I'm leaning towards going back to my original hugetlb-private
"nodes_allowed_from_node()" or such.  I can use nodemask_from_node to
initialize it, if that will make Mel happy, but trying to force fit an
existing "generic" function just because it's generic seems pointless.

So, I'm going to let this series rest until I hear back from you and Mel
on how to proceed with this. 

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
@ 2009-08-26 20:46               ` Lee Schermerhorn
  0 siblings, 0 replies; 51+ messages in thread
From: Lee Schermerhorn @ 2009-08-26 20:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, linux-mm, linux-numa, akpm, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney

On Wed, 2009-08-26 at 12:47 -0700, David Rientjes wrote:
> On Wed, 26 Aug 2009, Lee Schermerhorn wrote:
> 
> > Against: 2.6.31-rc6-mmotm-090820-1918
> > 
> > Introduce nodemask macro to allocate a nodemask and 
> > initialize it to contain a single node, using existing
> > nodemask_of_node() macro.  Coded as a macro to avoid header
> > dependency hell.
> > 
> > This will be used to construct the huge pages "nodes_allowed"
> > nodemask for a single node when a persistent huge page
> > pool page count is modified via a per node sysfs attribute.
> > 
> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> >  include/linux/nodemask.h |   10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/nodemask.h	2009-08-24 10:16:56.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h	2009-08-26 12:38:31.000000000 -0400
> > @@ -257,6 +257,16 @@ static inline int __next_node(int n, con
> >  	m;								\
> >  })
> >  
> > +#define alloc_nodemask_of_node(node)					\
> > +({									\
> > +	typeof(_unused_nodemask_arg_) *nmp;				\
> > +	nmp = kmalloc(sizeof(*nmp), GFP_KERNEL);			\
> > +	if (nmp)							\
> > +		*nmp = nodemask_of_node(node);				\
> > +	nmp;								\
> > +})
> > +
> > +
> >  #define first_unset_node(mask) __first_unset_node(&(mask))
> >  static inline int __first_unset_node(const nodemask_t *maskp)
> >  {
> 
> I think it would probably be better to use the generic NODEMASK_ALLOC() 
> interface by requiring it to pass the entire type (including "struct") as 
> part of the first parameter.  Then it automatically takes care of 
> dynamically allocating large nodemasks vs. allocating them on the stack.
> 
> Would it work by redefining NODEMASK_ALLOC() in the NODES_SHIFT > 8 case 
> to be this:
> 
> 	#define NODEMASK_ALLOC(x, m) x *m = kmalloc(sizeof(*m), GFP_KERNEL);
> 
> and converting NODEMASK_SCRATCH(x) to NODEMASK_ALLOC(struct 
> nodemask_scratch, x), and then doing this in your code:
> 
> 	NODEMASK_ALLOC(nodemask_t, nodes_allowed);
> 	if (nodes_allowed)
> 		*nodes_allowed = nodemask_of_node(node);
> 
> The NODEMASK_{ALLOC,SCRATCH}() interface is in its infancy so it can 
> probably be made more general to handle cases like this.

I just don't know what that would accomplish.  Heck, I'm not all that
happy with the alloc_nodemask_from_node() because it's allocating both a
hidden nodemask_t and a pointer thereto on the stack just to return a
pointer to a kmalloc()ed nodemask_t--which is what I want/need here.

One issue I have with NODEMASK_ALLOC() [and nodemask_of_node(), et al]
is that it declares the pointer variable as well as initializing it,
perhaps with kmalloc(), ...   Indeed, it's purpose is to replace on
stack nodemask declarations.

So, to use it at the start of, e.g., set_max_huge_pages() where I can
safely use it throughout the function, I'll end up allocating the
nodes_allowed mask on every call, whether or not a node is specified or
there is a non-default mempolicy.   If it turns out that no node was
specified and we have default policy, we need to free the mask and NULL
out nodes_allowed up front so that we get default behavior.  That seems
uglier to me that only allocating the nodemask when we know we need one.

I'm not opposed to using a generic function/macro where one exists that
suits my purposes.   I just don't see one.  I tried to create
one--alloc_nodemask_from_node(), and to keep Mel happy, I tried to reuse
nodemask_from_node() to initialize it.  I'm really not happy with the
results--because of those extra, hidden stack variables.  I could
eliminate those by creating a out of line function, but there's no good
place to put a generic nodemask function--no nodemask.c.  

I'm leaning towards going back to my original hugetlb-private
"nodes_allowed_from_node()" or such.  I can use nodemask_from_node to
initialize it, if that will make Mel happy, but trying to force fit an
existing "generic" function just because it's generic seems pointless.

So, I'm going to let this series rest until I hear back from you and Mel
on how to proceed with this. 

Lee

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
  2009-08-26 20:46               ` Lee Schermerhorn
@ 2009-08-27  9:52                 ` Mel Gorman
  -1 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2009-08-27  9:52 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: David Rientjes, linux-mm, linux-numa, akpm, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney

On Wed, Aug 26, 2009 at 04:46:43PM -0400, Lee Schermerhorn wrote:
> On Wed, 2009-08-26 at 12:47 -0700, David Rientjes wrote:
> > On Wed, 26 Aug 2009, Lee Schermerhorn wrote:
> > 
> > > Against: 2.6.31-rc6-mmotm-090820-1918
> > > 
> > > Introduce nodemask macro to allocate a nodemask and 
> > > initialize it to contain a single node, using existing
> > > nodemask_of_node() macro.  Coded as a macro to avoid header
> > > dependency hell.
> > > 
> > > This will be used to construct the huge pages "nodes_allowed"
> > > nodemask for a single node when a persistent huge page
> > > pool page count is modified via a per node sysfs attribute.
> > > 
> > > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > > 
> > >  include/linux/nodemask.h |   10 ++++++++++
> > >  1 file changed, 10 insertions(+)
> > > 
> > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h
> > > ===================================================================
> > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/nodemask.h	2009-08-24 10:16:56.000000000 -0400
> > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h	2009-08-26 12:38:31.000000000 -0400
> > > @@ -257,6 +257,16 @@ static inline int __next_node(int n, con
> > >  	m;								\
> > >  })
> > >  
> > > +#define alloc_nodemask_of_node(node)					\
> > > +({									\
> > > +	typeof(_unused_nodemask_arg_) *nmp;				\
> > > +	nmp = kmalloc(sizeof(*nmp), GFP_KERNEL);			\
> > > +	if (nmp)							\
> > > +		*nmp = nodemask_of_node(node);				\
> > > +	nmp;								\
> > > +})
> > > +
> > > +
> > >  #define first_unset_node(mask) __first_unset_node(&(mask))
> > >  static inline int __first_unset_node(const nodemask_t *maskp)
> > >  {
> > 
> > I think it would probably be better to use the generic NODEMASK_ALLOC() 
> > interface by requiring it to pass the entire type (including "struct") as 
> > part of the first parameter.  Then it automatically takes care of 
> > dynamically allocating large nodemasks vs. allocating them on the stack.
> > 
> > Would it work by redefining NODEMASK_ALLOC() in the NODES_SHIFT > 8 case 
> > to be this:
> > 
> > 	#define NODEMASK_ALLOC(x, m) x *m = kmalloc(sizeof(*m), GFP_KERNEL);
> > 
> > and converting NODEMASK_SCRATCH(x) to NODEMASK_ALLOC(struct 
> > nodemask_scratch, x), and then doing this in your code:
> > 
> > 	NODEMASK_ALLOC(nodemask_t, nodes_allowed);
> > 	if (nodes_allowed)
> > 		*nodes_allowed = nodemask_of_node(node);
> > 
> > The NODEMASK_{ALLOC,SCRATCH}() interface is in its infancy so it can 
> > probably be made more general to handle cases like this.
> 
> I just don't know what that would accomplish.  Heck, I'm not all that
> happy with the alloc_nodemask_from_node() because it's allocating both a
> hidden nodemask_t and a pointer thereto on the stack just to return a
> pointer to a kmalloc()ed nodemask_t--which is what I want/need here.
> 
> One issue I have with NODEMASK_ALLOC() [and nodemask_of_node(), et al]
> is that it declares the pointer variable as well as initializing it,
> perhaps with kmalloc(), ...   Indeed, it's purpose is to replace on
> stack nodemask declarations.
> 
> So, to use it at the start of, e.g., set_max_huge_pages() where I can
> safely use it throughout the function, I'll end up allocating the
> nodes_allowed mask on every call, whether or not a node is specified or
> there is a non-default mempolicy.   If it turns out that no node was
> specified and we have default policy, we need to free the mask and NULL
> out nodes_allowed up front so that we get default behavior.  That seems
> uglier to me that only allocating the nodemask when we know we need one.
> 
> I'm not opposed to using a generic function/macro where one exists that
> suits my purposes.   I just don't see one.  I tried to create
> one--alloc_nodemask_from_node(), and to keep Mel happy, I tried to reuse
> nodemask_from_node() to initialize it.  I'm really not happy with the
> results--because of those extra, hidden stack variables.  I could
> eliminate those by creating a out of line function, but there's no good
> place to put a generic nodemask function--no nodemask.c.  
> 

Ok. When I brought the subject up, it looked like you were creating a
hugetlbfs-specific helper that looked like it would have generic helpers. While
that is still the case, it's looking like generic helpers make things worse
and hide side-effects in helper functions that might cause greater difficulty
in the future. I'm happier to go with the existing code than I was before
so consider my objection dropped.

> I'm leaning towards going back to my original hugetlb-private
> "nodes_allowed_from_node()" or such.  I can use nodemask_from_node to
> initialize it, if that will make Mel happy, but trying to force fit an
> existing "generic" function just because it's generic seems pointless.
> 
> So, I'm going to let this series rest until I hear back from you and Mel
> on how to proceed with this. 
> 

I hate to do it to you, but at this point, I'm leaning towards your current
approach.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
@ 2009-08-27  9:52                 ` Mel Gorman
  0 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2009-08-27  9:52 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: David Rientjes, linux-mm, linux-numa, akpm, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney

On Wed, Aug 26, 2009 at 04:46:43PM -0400, Lee Schermerhorn wrote:
> On Wed, 2009-08-26 at 12:47 -0700, David Rientjes wrote:
> > On Wed, 26 Aug 2009, Lee Schermerhorn wrote:
> > 
> > > Against: 2.6.31-rc6-mmotm-090820-1918
> > > 
> > > Introduce nodemask macro to allocate a nodemask and 
> > > initialize it to contain a single node, using existing
> > > nodemask_of_node() macro.  Coded as a macro to avoid header
> > > dependency hell.
> > > 
> > > This will be used to construct the huge pages "nodes_allowed"
> > > nodemask for a single node when a persistent huge page
> > > pool page count is modified via a per node sysfs attribute.
> > > 
> > > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > > 
> > >  include/linux/nodemask.h |   10 ++++++++++
> > >  1 file changed, 10 insertions(+)
> > > 
> > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h
> > > ===================================================================
> > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/nodemask.h	2009-08-24 10:16:56.000000000 -0400
> > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h	2009-08-26 12:38:31.000000000 -0400
> > > @@ -257,6 +257,16 @@ static inline int __next_node(int n, con
> > >  	m;								\
> > >  })
> > >  
> > > +#define alloc_nodemask_of_node(node)					\
> > > +({									\
> > > +	typeof(_unused_nodemask_arg_) *nmp;				\
> > > +	nmp = kmalloc(sizeof(*nmp), GFP_KERNEL);			\
> > > +	if (nmp)							\
> > > +		*nmp = nodemask_of_node(node);				\
> > > +	nmp;								\
> > > +})
> > > +
> > > +
> > >  #define first_unset_node(mask) __first_unset_node(&(mask))
> > >  static inline int __first_unset_node(const nodemask_t *maskp)
> > >  {
> > 
> > I think it would probably be better to use the generic NODEMASK_ALLOC() 
> > interface by requiring it to pass the entire type (including "struct") as 
> > part of the first parameter.  Then it automatically takes care of 
> > dynamically allocating large nodemasks vs. allocating them on the stack.
> > 
> > Would it work by redefining NODEMASK_ALLOC() in the NODES_SHIFT > 8 case 
> > to be this:
> > 
> > 	#define NODEMASK_ALLOC(x, m) x *m = kmalloc(sizeof(*m), GFP_KERNEL);
> > 
> > and converting NODEMASK_SCRATCH(x) to NODEMASK_ALLOC(struct 
> > nodemask_scratch, x), and then doing this in your code:
> > 
> > 	NODEMASK_ALLOC(nodemask_t, nodes_allowed);
> > 	if (nodes_allowed)
> > 		*nodes_allowed = nodemask_of_node(node);
> > 
> > The NODEMASK_{ALLOC,SCRATCH}() interface is in its infancy so it can 
> > probably be made more general to handle cases like this.
> 
> I just don't know what that would accomplish.  Heck, I'm not all that
> happy with the alloc_nodemask_from_node() because it's allocating both a
> hidden nodemask_t and a pointer thereto on the stack just to return a
> pointer to a kmalloc()ed nodemask_t--which is what I want/need here.
> 
> One issue I have with NODEMASK_ALLOC() [and nodemask_of_node(), et al]
> is that it declares the pointer variable as well as initializing it,
> perhaps with kmalloc(), ...   Indeed, it's purpose is to replace on
> stack nodemask declarations.
> 
> So, to use it at the start of, e.g., set_max_huge_pages() where I can
> safely use it throughout the function, I'll end up allocating the
> nodes_allowed mask on every call, whether or not a node is specified or
> there is a non-default mempolicy.   If it turns out that no node was
> specified and we have default policy, we need to free the mask and NULL
> out nodes_allowed up front so that we get default behavior.  That seems
> uglier to me that only allocating the nodemask when we know we need one.
> 
> I'm not opposed to using a generic function/macro where one exists that
> suits my purposes.   I just don't see one.  I tried to create
> one--alloc_nodemask_from_node(), and to keep Mel happy, I tried to reuse
> nodemask_from_node() to initialize it.  I'm really not happy with the
> results--because of those extra, hidden stack variables.  I could
> eliminate those by creating a out of line function, but there's no good
> place to put a generic nodemask function--no nodemask.c.  
> 

Ok. When I brought the subject up, it looked like you were creating a
hugetlbfs-specific helper that looked like it would have generic helpers. While
that is still the case, it's looking like generic helpers make things worse
and hide side-effects in helper functions that might cause greater difficulty
in the future. I'm happier to go with the existing code than I was before
so consider my objection dropped.

> I'm leaning towards going back to my original hugetlb-private
> "nodes_allowed_from_node()" or such.  I can use nodemask_from_node to
> initialize it, if that will make Mel happy, but trying to force fit an
> existing "generic" function just because it's generic seems pointless.
> 
> So, I'm going to let this series rest until I hear back from you and Mel
> on how to proceed with this. 
> 

I hate to do it to you, but at this point, I'm leaning towards your current
approach.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
  2009-08-26 18:04         ` Lee Schermerhorn
@ 2009-08-27 10:23           ` Mel Gorman
  2009-08-27 16:52             ` Lee Schermerhorn
  0 siblings, 1 reply; 51+ messages in thread
From: Mel Gorman @ 2009-08-27 10:23 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Wed, Aug 26, 2009 at 02:04:03PM -0400, Lee Schermerhorn wrote:
> <SNIP>
> This revised patch also removes the include of hugetlb.h from node.h.
> 
> Lee
> 
> ---
> 
> PATCH 5/6 hugetlb:  register per node hugepages attributes
> 
> Against: 2.6.31-rc6-mmotm-090820-1918
> 
> V2:  remove dependency on kobject private bitfield.  Search
>      global hstates then all per node hstates for kobject
>      match in attribute show/store functions.
> 
> V3:  rebase atop the mempolicy-based hugepage alloc/free;
>      use custom "nodes_allowed" to restrict alloc/free to
>      a specific node via per node attributes.  Per node
>      attribute overrides mempolicy.  I.e., mempolicy only
>      applies to global attributes.
> 
> V4:  Fix issues raised by Mel Gorman:
>      + add !NUMA versions of hugetlb_[un]register_node()
>      + rename 'hi' to 'i' in kobj_to_node_hstate()
>      + rename (count, input) to (len, count) in nr_hugepages_store()
>      + moved per node hugepages_kobj and hstate_kobjs[] from the
>        struct node [sysdev] to hugetlb.c private arrays.
>      + changed registration mechanism so that hugetlbfs [a module]
>        register its attributes registration callbacks with the node
>        driver, eliminating the dependency between the node driver
>        and hugetlbfs.  From it's init func, hugetlbfs will register
>        all on-line nodes' hugepage sysfs attributes along with
>        hugetlbfs' attributes register/unregister functions.  The
>        node driver will use these functions to [un]register nodes
>        with hugetlbfs on node hot-plug.
>      + replaced hugetlb.c private "nodes_allowed_from_node()" with
>        generic "alloc_nodemask_of_node()".
> 
> This patch adds the per huge page size control/query attributes
> to the per node sysdevs:
> 
> /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
> 	nr_hugepages       - r/w
> 	free_huge_pages    - r/o
> 	surplus_huge_pages - r/o
> 
> The patch attempts to re-use/share as much of the existing
> global hstate attribute initialization and handling, and the
> "nodes_allowed" constraint processing as possible.
> Calling set_max_huge_pages() with no node indicates a change to
> global hstate parameters.  In this case, any non-default task
> mempolicy will be used to generate the nodes_allowed mask.  A
> valid node id indicates an update to that node's hstate 
> parameters, and the count argument specifies the target count
> for the specified node.  From this info, we compute the target
> global count for the hstate and construct a nodes_allowed node
> mask contain only the specified node.
> 
> Setting the node specific nr_hugepages via the per node attribute
> effectively ignores any task mempolicy or cpuset constraints.
> 
> With this patch:
> 
> (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
> ./  ../  free_hugepages  nr_hugepages  surplus_hugepages
> 
> Starting from:
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:     0
> Node 2 HugePages_Free:      0
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
> vm.nr_hugepages = 0
> 
> Allocate 16 persistent huge pages on node 2:
> (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
> 
> [Note that this is equivalent to:
> 	numactl -m 2 hugeadmin --pool-pages-min 2M:+16
> ]
> 
> Yields:
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:    16
> Node 2 HugePages_Free:     16
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
> vm.nr_hugepages = 16
> 
> Global controls work as expected--reduce pool to 8 persistent huge pages:
> (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> 
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:     8
> Node 2 HugePages_Free:      8
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  drivers/base/node.c  |   27 +++++
>  include/linux/node.h |    6 +
>  include/linux/numa.h |    2 
>  mm/hugetlb.c         |  245 ++++++++++++++++++++++++++++++++++++++++++++-------
>  4 files changed, 250 insertions(+), 30 deletions(-)
> 
> Index: linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/drivers/base/node.c	2009-08-26 12:37:03.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c	2009-08-26 13:01:54.000000000 -0400
> @@ -177,6 +177,31 @@ static ssize_t node_read_distance(struct
>  }
>  static SYSDEV_ATTR(distance, S_IRUGO, node_read_distance, NULL);
>  
> +/*
> + * hugetlbfs per node attributes registration interface
> + */
> +NODE_REGISTRATION_FUNC __hugetlb_register_node;
> +NODE_REGISTRATION_FUNC __hugetlb_unregister_node;
> +
> +static inline void hugetlb_register_node(struct node *node)
> +{
> +	if (__hugetlb_register_node)
> +		__hugetlb_register_node(node);
> +}
> +
> +static inline void hugetlb_unregister_node(struct node *node)
> +{
> +	if (__hugetlb_unregister_node)
> +		__hugetlb_unregister_node(node);
> +}
> +
> +void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC doregister,
> +                                  NODE_REGISTRATION_FUNC unregister)
> +{
> +	__hugetlb_register_node   = doregister;
> +	__hugetlb_unregister_node = unregister;
> +}
> +
>  

I think I get this. Basically, you want to avoid the functions being
called too early before sysfs is initialised and still work with hotplug
later. So early in boot, no registeration happens. sysfs and hugetlbfs
get initialised and at that point, these hooks become active, all nodes
registered and hotplug later continues to work.

Is that accurate? Can it get a comment?

>  /*
>   * register_node - Setup a sysfs device for a node.
> @@ -200,6 +225,7 @@ int register_node(struct node *node, int
>  		sysdev_create_file(&node->sysdev, &attr_distance);
>  
>  		scan_unevictable_register_node(node);
> +		hugetlb_register_node(node);
>  	}
>  	return error;
>  }
> @@ -220,6 +246,7 @@ void unregister_node(struct node *node)
>  	sysdev_remove_file(&node->sysdev, &attr_distance);
>  
>  	scan_unevictable_unregister_node(node);
> +	hugetlb_unregister_node(node);
>  
>  	sysdev_unregister(&node->sysdev);
>  }
> Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-26 12:37:04.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-26 13:01:54.000000000 -0400
> @@ -24,6 +24,7 @@
>  #include <asm/io.h>
>  
>  #include <linux/hugetlb.h>
> +#include <linux/node.h>
>  #include "internal.h"
>  
>  const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> @@ -1245,7 +1246,8 @@ static int adjust_pool_surplus(struct hs
>  }
>  
>  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> +								int nid)
>  {
>  	unsigned long min_count, ret;
>  	nodemask_t *nodes_allowed;
> @@ -1253,7 +1255,21 @@ static unsigned long set_max_huge_pages(
>  	if (h->order >= MAX_ORDER)
>  		return h->max_huge_pages;
>  
> -	nodes_allowed = huge_mpol_nodes_allowed();
> +	if (nid == NO_NODEID_SPECIFIED)
> +		nodes_allowed = huge_mpol_nodes_allowed();
> +	else {
> +		/*
> +		 * incoming 'count' is for node 'nid' only, so
> +		 * adjust count to global, but restrict alloc/free
> +		 * to the specified node.
> +		 */
> +		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> +		nodes_allowed = alloc_nodemask_of_node(nid);

alloc_nodemask_of_node() isn't defined anywhere.

> +		if (!nodes_allowed)
> +			printk(KERN_WARNING "%s unable to allocate allowed "
> +			       "nodes mask for huge page allocation/free.  "
> +			       "Falling back to default.\n", current->comm);
> +	}
>  
>  	/*
>  	 * Increase the pool size
> @@ -1329,51 +1345,71 @@ out:
>  static struct kobject *hugepages_kobj;
>  static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
>  
> -static struct hstate *kobj_to_hstate(struct kobject *kobj)
> +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp);
> +
> +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
>  {
>  	int i;
> +
>  	for (i = 0; i < HUGE_MAX_HSTATE; i++)
> -		if (hstate_kobjs[i] == kobj)
> +		if (hstate_kobjs[i] == kobj) {
> +			if (nidp)
> +				*nidp = NO_NODEID_SPECIFIED;
>  			return &hstates[i];
> -	BUG();
> -	return NULL;
> +		}
> +
> +	return kobj_to_node_hstate(kobj, nidp);
>  }
>  
>  static ssize_t nr_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> -	return sprintf(buf, "%lu\n", h->nr_huge_pages);
> +	struct hstate *h;
> +	unsigned long nr_huge_pages;
> +	int nid;
> +
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (nid == NO_NODEID_SPECIFIED)
> +		nr_huge_pages = h->nr_huge_pages;
> +	else
> +		nr_huge_pages = h->nr_huge_pages_node[nid];
> +
> +	return sprintf(buf, "%lu\n", nr_huge_pages);
>  }
> +
>  static ssize_t nr_hugepages_store(struct kobject *kobj,
> -		struct kobj_attribute *attr, const char *buf, size_t count)
> +		struct kobj_attribute *attr, const char *buf, size_t len)
>  {
> +	unsigned long count;
> +	struct hstate *h;
> +	int nid;
>  	int err;
> -	unsigned long input;
> -	struct hstate *h = kobj_to_hstate(kobj);
>  
> -	err = strict_strtoul(buf, 10, &input);
> +	err = strict_strtoul(buf, 10, &count);
>  	if (err)
>  		return 0;
>  
> -	h->max_huge_pages = set_max_huge_pages(h, input);
> +	h = kobj_to_hstate(kobj, &nid);
> +	h->max_huge_pages = set_max_huge_pages(h, count, nid);
>  
> -	return count;
> +	return len;
>  }
>  HSTATE_ATTR(nr_hugepages);
>  
>  static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h = kobj_to_hstate(kobj, NULL);
> +
>  	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
>  }
> +
>  static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
>  		struct kobj_attribute *attr, const char *buf, size_t count)
>  {
>  	int err;
>  	unsigned long input;
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h = kobj_to_hstate(kobj, NULL);
>  
>  	err = strict_strtoul(buf, 10, &input);
>  	if (err)
> @@ -1390,15 +1426,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
>  static ssize_t free_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> -	return sprintf(buf, "%lu\n", h->free_huge_pages);
> +	struct hstate *h;
> +	unsigned long free_huge_pages;
> +	int nid;
> +
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (nid == NO_NODEID_SPECIFIED)
> +		free_huge_pages = h->free_huge_pages;
> +	else
> +		free_huge_pages = h->free_huge_pages_node[nid];
> +
> +	return sprintf(buf, "%lu\n", free_huge_pages);
>  }
>  HSTATE_ATTR_RO(free_hugepages);
>  
>  static ssize_t resv_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> +	struct hstate *h = kobj_to_hstate(kobj, NULL);
>  	return sprintf(buf, "%lu\n", h->resv_huge_pages);
>  }
>  HSTATE_ATTR_RO(resv_hugepages);
> @@ -1406,8 +1451,17 @@ HSTATE_ATTR_RO(resv_hugepages);
>  static ssize_t surplus_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
> -	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
> +	struct hstate *h;
> +	unsigned long surplus_huge_pages;
> +	int nid;
> +
> +	h = kobj_to_hstate(kobj, &nid);
> +	if (nid == NO_NODEID_SPECIFIED)
> +		surplus_huge_pages = h->surplus_huge_pages;
> +	else
> +		surplus_huge_pages = h->surplus_huge_pages_node[nid];
> +
> +	return sprintf(buf, "%lu\n", surplus_huge_pages);
>  }
>  HSTATE_ATTR_RO(surplus_hugepages);
>  
> @@ -1424,19 +1478,21 @@ static struct attribute_group hstate_att
>  	.attrs = hstate_attrs,
>  };
>  
> -static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
> +static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
> +				struct kobject *parent,
> +				struct kobject **hstate_kobjs,
> +				struct attribute_group *hstate_attr_group)
>  {
>  	int retval;
> +	int hi = h - hstates;
>  
> -	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
> -							hugepages_kobj);
> -	if (!hstate_kobjs[h - hstates])
> +	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
> +	if (!hstate_kobjs[hi])
>  		return -ENOMEM;
>  
> -	retval = sysfs_create_group(hstate_kobjs[h - hstates],
> -							&hstate_attr_group);
> +	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
>  	if (retval)
> -		kobject_put(hstate_kobjs[h - hstates]);
> +		kobject_put(hstate_kobjs[hi]);
>  
>  	return retval;
>  }
> @@ -1451,17 +1507,143 @@ static void __init hugetlb_sysfs_init(vo
>  		return;
>  
>  	for_each_hstate(h) {
> -		err = hugetlb_sysfs_add_hstate(h);
> +		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
> +					 hstate_kobjs, &hstate_attr_group);
>  		if (err)
>  			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
>  								h->name);
>  	}
>  }
>  
> +#ifdef CONFIG_NUMA
> +
> +struct node_hstate {
> +	struct kobject		*hugepages_kobj;
> +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
> +};
> +struct node_hstate node_hstates[MAX_NUMNODES];
> +
> +static struct attribute *per_node_hstate_attrs[] = {
> +	&nr_hugepages_attr.attr,
> +	&free_hugepages_attr.attr,
> +	&surplus_hugepages_attr.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group per_node_hstate_attr_group = {
> +	.attrs = per_node_hstate_attrs,
> +};
> +
> +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> +{
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		struct node_hstate *nhs = &node_hstates[nid];
> +		int i;
> +		for (i = 0; i < HUGE_MAX_HSTATE; i++)
> +			if (nhs->hstate_kobjs[i] == kobj) {
> +				if (nidp)
> +					*nidp = nid;
> +				return &hstates[i];
> +			}
> +	}
> +
> +	BUG();
> +	return NULL;
> +}

Ok, this looks nicer in that the dependencies between hugetlbfs and base
node support are going the right direction.

> +
> +void hugetlb_unregister_node(struct node *node)
> +{
> +	struct hstate *h;
> +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> +
> +	if (!nhs->hugepages_kobj)
> +		return;
> +
> +	for_each_hstate(h)
> +		if (nhs->hstate_kobjs[h - hstates]) {
> +			kobject_put(nhs->hstate_kobjs[h - hstates]);
> +			nhs->hstate_kobjs[h - hstates] = NULL;
> +		}
> +
> +	kobject_put(nhs->hugepages_kobj);
> +	nhs->hugepages_kobj = NULL;
> +}
> +
> +static void hugetlb_unregister_all_nodes(void)
> +{
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++)
> +		hugetlb_unregister_node(&node_devices[nid]);
> +
> +	register_hugetlbfs_with_node(NULL, NULL);
> +}
> +
> +void hugetlb_register_node(struct node *node)
> +{
> +	struct hstate *h;
> +	struct node_hstate *nhs = &node_hstates[node->sysdev.id];
> +	int err;
> +
> +	if (nhs->hugepages_kobj)
> +		return;		/* already allocated */
> +
> +	nhs->hugepages_kobj = kobject_create_and_add("hugepages",
> +							&node->sysdev.kobj);
> +	if (!nhs->hugepages_kobj)
> +		return;
> +
> +	for_each_hstate(h) {
> +		err = hugetlb_sysfs_add_hstate(h, nhs->hugepages_kobj,
> +						nhs->hstate_kobjs,
> +						&per_node_hstate_attr_group);
> +		if (err) {
> +			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
> +					" for node %d\n",
> +						h->name, node->sysdev.id);
> +			hugetlb_unregister_node(node);
> +			break;
> +		}
> +	}
> +}
> +
> +static void hugetlb_register_all_nodes(void)
> +{
> +	int nid;
> +
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		struct node *node = &node_devices[nid];
> +		if (node->sysdev.id == nid)
> +			hugetlb_register_node(node);
> +	}
> +
> +	register_hugetlbfs_with_node(hugetlb_register_node,
> +                                     hugetlb_unregister_node);
> +}
> +#else	/* !CONFIG_NUMA */
> +
> +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> +{
> +	BUG();
> +	if (nidp)
> +		*nidp = -1;
> +	return NULL;
> +}
> +
> +static void hugetlb_unregister_all_nodes(void) { }
> +
> +static void hugetlb_register_all_nodes(void) { }
> +
> +#endif
> +
>  static void __exit hugetlb_exit(void)
>  {
>  	struct hstate *h;
>  
> +	hugetlb_unregister_all_nodes();
> +
>  	for_each_hstate(h) {
>  		kobject_put(hstate_kobjs[h - hstates]);
>  	}
> @@ -1496,6 +1678,8 @@ static int __init hugetlb_init(void)
>  
>  	hugetlb_sysfs_init();
>  
> +	hugetlb_register_all_nodes();
> +
>  	return 0;
>  }
>  module_init(hugetlb_init);
> @@ -1598,7 +1782,8 @@ int hugetlb_sysctl_handler(struct ctl_ta
>  	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
>  
>  	if (write)
> -		h->max_huge_pages = set_max_huge_pages(h, tmp);
> +		h->max_huge_pages = set_max_huge_pages(h, tmp,
> +		                                       NO_NODEID_SPECIFIED);
>  
>  	return 0;
>  }
> Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/numa.h
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/numa.h	2009-08-26 12:37:03.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/numa.h	2009-08-26 12:58:54.000000000 -0400
> @@ -10,4 +10,6 @@
>  
>  #define MAX_NUMNODES    (1 << NODES_SHIFT)
>  
> +#define NO_NODEID_SPECIFIED	(-1)
> +
>  #endif /* _LINUX_NUMA_H */
> Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h	2009-08-26 12:37:03.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h	2009-08-26 12:40:19.000000000 -0400
> @@ -28,6 +28,7 @@ struct node {
>  
>  struct memory_block;
>  extern struct node node_devices[];
> +typedef  void (*NODE_REGISTRATION_FUNC)(struct node *);
>  
>  extern int register_node(struct node *, int, struct node *);
>  extern void unregister_node(struct node *node);
> @@ -39,6 +40,8 @@ extern int unregister_cpu_under_node(uns
>  extern int register_mem_sect_under_node(struct memory_block *mem_blk,
>  						int nid);
>  extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk);
> +extern void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC doregister,
> +                                         NODE_REGISTRATION_FUNC unregister);
>  #else
>  static inline int register_one_node(int nid)
>  {
> @@ -65,6 +68,9 @@ static inline int unregister_mem_sect_un
>  {
>  	return 0;
>  }
> +
> +static inline void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC do,
> +                                                NODE_REGISTRATION_FUNC un) { }

"do" is a keyword. This won't compile on !NUMA. needs to be called
doregister and unregister or basically anything other than "do"

>  #endif
>  
>  #define to_node(sys_device) container_of(sys_device, struct node, sysdev)
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
  2009-08-27 10:23           ` Mel Gorman
@ 2009-08-27 16:52             ` Lee Schermerhorn
  2009-08-28 10:09                 ` Mel Gorman
  0 siblings, 1 reply; 51+ messages in thread
From: Lee Schermerhorn @ 2009-08-27 16:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Thu, 2009-08-27 at 11:23 +0100, Mel Gorman wrote:
> On Wed, Aug 26, 2009 at 02:04:03PM -0400, Lee Schermerhorn wrote:
> > <SNIP>
> > This revised patch also removes the include of hugetlb.h from node.h.
> > 
> > Lee
> > 
> > ---
> > 
> > PATCH 5/6 hugetlb:  register per node hugepages attributes
> > 
> > Against: 2.6.31-rc6-mmotm-090820-1918
> > 
> > V2:  remove dependency on kobject private bitfield.  Search
> >      global hstates then all per node hstates for kobject
> >      match in attribute show/store functions.
> > 
> > V3:  rebase atop the mempolicy-based hugepage alloc/free;
> >      use custom "nodes_allowed" to restrict alloc/free to
> >      a specific node via per node attributes.  Per node
> >      attribute overrides mempolicy.  I.e., mempolicy only
> >      applies to global attributes.
> > 
> > V4:  Fix issues raised by Mel Gorman:
> >      + add !NUMA versions of hugetlb_[un]register_node()
> >      + rename 'hi' to 'i' in kobj_to_node_hstate()
> >      + rename (count, input) to (len, count) in nr_hugepages_store()
> >      + moved per node hugepages_kobj and hstate_kobjs[] from the
> >        struct node [sysdev] to hugetlb.c private arrays.
> >      + changed registration mechanism so that hugetlbfs [a module]
> >        register its attributes registration callbacks with the node
> >        driver, eliminating the dependency between the node driver
> >        and hugetlbfs.  From it's init func, hugetlbfs will register
> >        all on-line nodes' hugepage sysfs attributes along with
> >        hugetlbfs' attributes register/unregister functions.  The
> >        node driver will use these functions to [un]register nodes
> >        with hugetlbfs on node hot-plug.
> >      + replaced hugetlb.c private "nodes_allowed_from_node()" with
> >        generic "alloc_nodemask_of_node()".
> > 
> > This patch adds the per huge page size control/query attributes
> > to the per node sysdevs:
> > 
> > /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
> > 	nr_hugepages       - r/w
> > 	free_huge_pages    - r/o
> > 	surplus_huge_pages - r/o
> > 
> > The patch attempts to re-use/share as much of the existing
> > global hstate attribute initialization and handling, and the
> > "nodes_allowed" constraint processing as possible.
> > Calling set_max_huge_pages() with no node indicates a change to
> > global hstate parameters.  In this case, any non-default task
> > mempolicy will be used to generate the nodes_allowed mask.  A
> > valid node id indicates an update to that node's hstate 
> > parameters, and the count argument specifies the target count
> > for the specified node.  From this info, we compute the target
> > global count for the hstate and construct a nodes_allowed node
> > mask contain only the specified node.
> > 
> > Setting the node specific nr_hugepages via the per node attribute
> > effectively ignores any task mempolicy or cpuset constraints.
> > 
<snip>
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/drivers/base/node.c	2009-08-26 12:37:03.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c	2009-08-26 13:01:54.000000000 -0400
> > @@ -177,6 +177,31 @@ static ssize_t node_read_distance(struct
> >  }
> >  static SYSDEV_ATTR(distance, S_IRUGO, node_read_distance, NULL);
> >  
> > +/*
> > + * hugetlbfs per node attributes registration interface
> > + */
> > +NODE_REGISTRATION_FUNC __hugetlb_register_node;
> > +NODE_REGISTRATION_FUNC __hugetlb_unregister_node;
> > +
> > +static inline void hugetlb_register_node(struct node *node)
> > +{
> > +	if (__hugetlb_register_node)
> > +		__hugetlb_register_node(node);
> > +}
> > +
> > +static inline void hugetlb_unregister_node(struct node *node)
> > +{
> > +	if (__hugetlb_unregister_node)
> > +		__hugetlb_unregister_node(node);
> > +}
> > +
> > +void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC doregister,
> > +                                  NODE_REGISTRATION_FUNC unregister)
> > +{
> > +	__hugetlb_register_node   = doregister;
> > +	__hugetlb_unregister_node = unregister;
> > +}
> > +
> >  
> 
> I think I get this. Basically, you want to avoid the functions being
> called too early before sysfs is initialised and still work with hotplug
> later. So early in boot, no registeration happens. sysfs and hugetlbfs
> get initialised and at that point, these hooks become active, all nodes
> registered and hotplug later continues to work.
> 
> Is that accurate? Can it get a comment?

Yes, you got it, and yes, I'll add a comment.  I had explained it in the
patch description [V4], but that's not too useful to someone coming
along later...

<snip>

> > @@ -1253,7 +1255,21 @@ static unsigned long set_max_huge_pages(
> >  	if (h->order >= MAX_ORDER)
> >  		return h->max_huge_pages;
> >  
> > -	nodes_allowed = huge_mpol_nodes_allowed();
> > +	if (nid == NO_NODEID_SPECIFIED)
> > +		nodes_allowed = huge_mpol_nodes_allowed();
> > +	else {
> > +		/*
> > +		 * incoming 'count' is for node 'nid' only, so
> > +		 * adjust count to global, but restrict alloc/free
> > +		 * to the specified node.
> > +		 */
> > +		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> > +		nodes_allowed = alloc_nodemask_of_node(nid);
> 
> alloc_nodemask_of_node() isn't defined anywhere.


Well, that's because the patch that defines it is in a message that I
meant to send before this one.  I see it's in my Drafts folder.  I'll
attach that patch below.  I'm rebasing against the 0827 mmotm, and I'll
resend the rebased series.  However, I wanted to get your opinion of the
nodemask patch below.

<snip>
> >  
> > +#ifdef CONFIG_NUMA
> > +
> > +struct node_hstate {
> > +	struct kobject		*hugepages_kobj;
> > +	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
> > +};
> > +struct node_hstate node_hstates[MAX_NUMNODES];
> > +
> > +static struct attribute *per_node_hstate_attrs[] = {
> > +	&nr_hugepages_attr.attr,
> > +	&free_hugepages_attr.attr,
> > +	&surplus_hugepages_attr.attr,
> > +	NULL,
> > +};
> > +
> > +static struct attribute_group per_node_hstate_attr_group = {
> > +	.attrs = per_node_hstate_attrs,
> > +};
> > +
> > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> > +{
> > +	int nid;
> > +
> > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > +		struct node_hstate *nhs = &node_hstates[nid];
> > +		int i;
> > +		for (i = 0; i < HUGE_MAX_HSTATE; i++)
> > +			if (nhs->hstate_kobjs[i] == kobj) {
> > +				if (nidp)
> > +					*nidp = nid;
> > +				return &hstates[i];
> > +			}
> > +	}
> > +
> > +	BUG();
> > +	return NULL;
> > +}
> 
> Ok, this looks nicer in that the dependencies between hugetlbfs and base
> node support are going the right direction.

Agreed.  I removed that "issue" from the patch description.

<snip>
> > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h
> > ===================================================================
> > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h	2009-08-26 12:37:03.000000000 -0400
> > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h	2009-08-26 12:40:19.000000000 -0400
> > @@ -28,6 +28,7 @@ struct node {
> >  
> >  struct memory_block;
> >  extern struct node node_devices[];
> > +typedef  void (*NODE_REGISTRATION_FUNC)(struct node *);
> >  
> >  extern int register_node(struct node *, int, struct node *);
> >  extern void unregister_node(struct node *node);
> > @@ -39,6 +40,8 @@ extern int unregister_cpu_under_node(uns
> >  extern int register_mem_sect_under_node(struct memory_block *mem_blk,
> >  						int nid);
> >  extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk);
> > +extern void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC doregister,
> > +                                         NODE_REGISTRATION_FUNC unregister);
> >  #else
> >  static inline int register_one_node(int nid)
> >  {
> > @@ -65,6 +68,9 @@ static inline int unregister_mem_sect_un
> >  {
> >  	return 0;
> >  }
> > +
> > +static inline void register_hugetlbfs_with_node(NODE_REGISTRATION_FUNC do,
> > +                                                NODE_REGISTRATION_FUNC un) { }
> 
> "do" is a keyword. This won't compile on !NUMA. needs to be called
> doregister and unregister or basically anything other than "do"

Sorry.  Last minute, obviously untested, addition.  I have built the
reworked code with and without NUMA.

Here's my current "alloc_nodemask_of_node()" patch.  What do you think
about going with this? 

PATCH 4/6 - hugetlb:  introduce alloc_nodemask_of_node()

Against: 2.6.31-rc6-mmotm-090820-1918

Introduce nodemask macro to allocate a nodemask and 
initialize it to contain a single node, using existing
nodemask_of_node() macro.  Coded as a macro to avoid header
dependency hell.

This will be used to construct the huge pages "nodes_allowed"
nodemask for a single node when a persistent huge page
pool page count is modified via a per node sysfs attribute.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/nodemask.h |   17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h
===================================================================
--- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/nodemask.h	2009-08-27 09:16:39.000000000 -0400
+++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/nodemask.h	2009-08-27 09:52:21.000000000 -0400
@@ -245,18 +245,31 @@ static inline int __next_node(int n, con
 	return min_t(int,MAX_NUMNODES,find_next_bit(srcp->bits, MAX_NUMNODES, n+1));
 }
 
+#define init_nodemask_of_nodes(mask, node)				\
+	nodes_clear(*(mask));						\
+	node_set((node), *(mask));
+
 #define nodemask_of_node(node)						\
 ({									\
 	typeof(_unused_nodemask_arg_) m;				\
 	if (sizeof(m) == sizeof(unsigned long)) {			\
 		m.bits[0] = 1UL<<(node);				\
 	} else {							\
-		nodes_clear(m);						\
-		node_set((node), m);					\
+		init_nodemask_of_nodes(&m, (node));			\
 	}								\
 	m;								\
 })
 
+#define alloc_nodemask_of_node(node)					\
+({									\
+	typeof(_unused_nodemask_arg_) *nmp;				\
+	nmp = kmalloc(sizeof(*nmp), GFP_KERNEL);			\
+	if (nmp)							\
+		init_nodemask_of_nodes(nmp, (node));			\
+	nmp;								\
+})
+
+
 #define first_unset_node(mask) __first_unset_node(&(mask))
 static inline int __first_unset_node(const nodemask_t *maskp)
 {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
  2009-08-26 20:46               ` Lee Schermerhorn
  (?)
  (?)
@ 2009-08-27 19:35               ` David Rientjes
  2009-08-28 12:56                 ` Lee Schermerhorn
  -1 siblings, 1 reply; 51+ messages in thread
From: David Rientjes @ 2009-08-27 19:35 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, linux-mm, linux-numa, Andrew Morton,
	Nishanth Aravamudan, Adam Litke, Andy Whitcroft, eric.whitney

On Wed, 26 Aug 2009, Lee Schermerhorn wrote:

> > I think it would probably be better to use the generic NODEMASK_ALLOC() 
> > interface by requiring it to pass the entire type (including "struct") as 
> > part of the first parameter.  Then it automatically takes care of 
> > dynamically allocating large nodemasks vs. allocating them on the stack.
> > 
> > Would it work by redefining NODEMASK_ALLOC() in the NODES_SHIFT > 8 case 
> > to be this:
> > 
> > 	#define NODEMASK_ALLOC(x, m) x *m = kmalloc(sizeof(*m), GFP_KERNEL);
> > 
> > and converting NODEMASK_SCRATCH(x) to NODEMASK_ALLOC(struct 
> > nodemask_scratch, x), and then doing this in your code:
> > 
> > 	NODEMASK_ALLOC(nodemask_t, nodes_allowed);
> > 	if (nodes_allowed)
> > 		*nodes_allowed = nodemask_of_node(node);
> > 
> > The NODEMASK_{ALLOC,SCRATCH}() interface is in its infancy so it can 
> > probably be made more general to handle cases like this.
> 
> I just don't know what that would accomplish.  Heck, I'm not all that
> happy with the alloc_nodemask_from_node() because it's allocating both a
> hidden nodemask_t and a pointer thereto on the stack just to return a
> pointer to a kmalloc()ed nodemask_t--which is what I want/need here.
> 
> One issue I have with NODEMASK_ALLOC() [and nodemask_of_node(), et al]
> is that it declares the pointer variable as well as initializing it,
> perhaps with kmalloc(), ...   Indeed, it's purpose is to replace on
> stack nodemask declarations.
> 

Right, which is why I suggest we only have one such interface to 
dynamically allocate nodemasks when NODES_SHIFT > 8.  That's what defines 
NODEMASK_ALLOC() as being special: it's taking NODES_SHIFT into 
consideration just like CPUMASK_ALLOC() would take NR_CPUS into 
consideration.  Your use case is the intended purpose of NODEMASK_ALLOC() 
and I see no reason why your code can't use the same interface with some 
modification and it's in the best interest of a maintainability to not 
duplicate specialized cases where pre-existing interfaces can be used (or 
improved, in this case).

> So, to use it at the start of, e.g., set_max_huge_pages() where I can
> safely use it throughout the function, I'll end up allocating the
> nodes_allowed mask on every call, whether or not a node is specified or
> there is a non-default mempolicy.  If it turns out that no node was
> specified and we have default policy, we need to free the mask and NULL
> out nodes_allowed up front so that we get default behavior.  That seems
> uglier to me that only allocating the nodemask when we know we need one.
> 

Not with my suggested code of disabling local irqs, getting a reference to 
the mempolicy so it can't be freed, reenabling, and then only using 
NODEMASK_ALLOC() in the switch statement on mpol->mode for MPOL_PREFERRED.

> I'm not opposed to using a generic function/macro where one exists that
> suits my purposes.   I just don't see one.  I tried to create
> one--alloc_nodemask_from_node(), and to keep Mel happy, I tried to reuse
> nodemask_from_node() to initialize it.  I'm really not happy with the
> results--because of those extra, hidden stack variables.  I could
> eliminate those by creating a out of line function, but there's no good
> place to put a generic nodemask function--no nodemask.c.  
> 

Using NODEMASK_ALLOC(nodes_allowed) wouldn't really be a hidden stack 
variable, would it?  I think most developers would assume that it is 
some automatic variable called `nodes_allowed' since it's later referenced 
(and only needs to be in the case of MPOL_PREFERRED if my mpol_get() 
solution with disabled local irqs is used).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] hugetlb:  derive huge pages nodes allowed from task mempolicy
  2009-08-25 20:49       ` Lee Schermerhorn
@ 2009-08-27 19:40         ` David Rientjes
  -1 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2009-08-27 19:40 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 25 Aug 2009, Lee Schermerhorn wrote:

> > > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
> > > ===================================================================
> > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-24 12:12:50.000000000 -0400
> > > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-24 12:12:53.000000000 -0400
> > > @@ -1257,10 +1257,13 @@ static int adjust_pool_surplus(struct hs
> > >  static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> > >  {
> > >  	unsigned long min_count, ret;
> > > +	nodemask_t *nodes_allowed;
> > >  
> > >  	if (h->order >= MAX_ORDER)
> > >  		return h->max_huge_pages;
> > >  
> > 
> > Why can't you simply do this?
> > 
> > 	struct mempolicy *pol = NULL;
> > 	nodemask_t *nodes_allowed = &node_online_map;
> > 
> > 	local_irq_disable();
> > 	pol = current->mempolicy;
> > 	mpol_get(pol);
> > 	local_irq_enable();
> > 	if (pol) {
> > 		switch (pol->mode) {
> > 		case MPOL_BIND:
> > 		case MPOL_INTERLEAVE:
> > 			nodes_allowed = pol->v.nodes;
> > 			break;
> > 		case MPOL_PREFERRED:
> > 			... use NODEMASK_SCRATCH() ...
> > 		default:
> > 			BUG();
> > 		}
> > 	}
> > 	mpol_put(pol);
> > 
> > and then use nodes_allowed throughout set_max_huge_pages()?
> 
> 
> Well, I do use nodes_allowed [pointer] throughout set_max_huge_pages().

Yeah, the above code would all be in set_max_huge_pages() and 
huge_mpol_nodes_allowed() would be removed.

> NODEMASK_SCRATCH() didn't exist when I wrote this, and I can't be sure
> it will return a kmalloc()'d nodemask, which I need because a NULL
> nodemask pointer means "all online nodes" [really all nodes with memory,
> I suppose] and I need a pointer to kmalloc()'d nodemask to return from
> huge_mpol_nodes_allowed().  I want to keep the access to the internals
> of mempolicy in mempolicy.[ch], thus the call out to
> huge_mpol_nodes_allowed(), instead of open coding it.

Ok, so you could add a mempolicy.c helper function that returns
nodemask_t * and either points to mpol->v.nodes for most cases after 
getting a reference on mpol with mpol_get() or points to a dynamically 
allocated NODEMASK_ALLOC() on a nodemask created for MPOL_PREFERRED.

This works nicely because either way you still have a reference to mpol, 
so you'll need to call into a mpol_nodemask_free() function which can use 
the same switch statement:

	void mpol_nodemask_free(struct mempolicy *mpol,
				struct nodemask_t *nodes_allowed)
	{
		switch (mpol->mode) {
		case MPOL_PREFERRED:
			kfree(nodes_allowed);
			break;
		default:
			break;
		}
		mpol_put(mpol);
	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] hugetlb:  derive huge pages nodes allowed from task mempolicy
@ 2009-08-27 19:40         ` David Rientjes
  0 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2009-08-27 19:40 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 25 Aug 2009, Lee Schermerhorn wrote:

> > > Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
> > > ===================================================================
> > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c	2009-08-24 12:12:50.000000000 -0400
> > > +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c	2009-08-24 12:12:53.000000000 -0400
> > > @@ -1257,10 +1257,13 @@ static int adjust_pool_surplus(struct hs
> > >  static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> > >  {
> > >  	unsigned long min_count, ret;
> > > +	nodemask_t *nodes_allowed;
> > >  
> > >  	if (h->order >= MAX_ORDER)
> > >  		return h->max_huge_pages;
> > >  
> > 
> > Why can't you simply do this?
> > 
> > 	struct mempolicy *pol = NULL;
> > 	nodemask_t *nodes_allowed = &node_online_map;
> > 
> > 	local_irq_disable();
> > 	pol = current->mempolicy;
> > 	mpol_get(pol);
> > 	local_irq_enable();
> > 	if (pol) {
> > 		switch (pol->mode) {
> > 		case MPOL_BIND:
> > 		case MPOL_INTERLEAVE:
> > 			nodes_allowed = pol->v.nodes;
> > 			break;
> > 		case MPOL_PREFERRED:
> > 			... use NODEMASK_SCRATCH() ...
> > 		default:
> > 			BUG();
> > 		}
> > 	}
> > 	mpol_put(pol);
> > 
> > and then use nodes_allowed throughout set_max_huge_pages()?
> 
> 
> Well, I do use nodes_allowed [pointer] throughout set_max_huge_pages().

Yeah, the above code would all be in set_max_huge_pages() and 
huge_mpol_nodes_allowed() would be removed.

> NODEMASK_SCRATCH() didn't exist when I wrote this, and I can't be sure
> it will return a kmalloc()'d nodemask, which I need because a NULL
> nodemask pointer means "all online nodes" [really all nodes with memory,
> I suppose] and I need a pointer to kmalloc()'d nodemask to return from
> huge_mpol_nodes_allowed().  I want to keep the access to the internals
> of mempolicy in mempolicy.[ch], thus the call out to
> huge_mpol_nodes_allowed(), instead of open coding it.

Ok, so you could add a mempolicy.c helper function that returns
nodemask_t * and either points to mpol->v.nodes for most cases after 
getting a reference on mpol with mpol_get() or points to a dynamically 
allocated NODEMASK_ALLOC() on a nodemask created for MPOL_PREFERRED.

This works nicely because either way you still have a reference to mpol, 
so you'll need to call into a mpol_nodemask_free() function which can use 
the same switch statement:

	void mpol_nodemask_free(struct mempolicy *mpol,
				struct nodemask_t *nodes_allowed)
	{
		switch (mpol->mode) {
		case MPOL_PREFERRED:
			kfree(nodes_allowed);
			break;
		default:
			break;
		}
		mpol_put(mpol);
	}

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
  2009-08-27 16:52             ` Lee Schermerhorn
@ 2009-08-28 10:09                 ` Mel Gorman
  0 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2009-08-28 10:09 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Thu, Aug 27, 2009 at 12:52:10PM -0400, Lee Schermerhorn wrote:
> <snip>
> 
> > > @@ -1253,7 +1255,21 @@ static unsigned long set_max_huge_pages(
> > >  	if (h->order >= MAX_ORDER)
> > >  		return h->max_huge_pages;
> > >  
> > > -	nodes_allowed = huge_mpol_nodes_allowed();
> > > +	if (nid == NO_NODEID_SPECIFIED)
> > > +		nodes_allowed = huge_mpol_nodes_allowed();
> > > +	else {
> > > +		/*
> > > +		 * incoming 'count' is for node 'nid' only, so
> > > +		 * adjust count to global, but restrict alloc/free
> > > +		 * to the specified node.
> > > +		 */
> > > +		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> > > +		nodes_allowed = alloc_nodemask_of_node(nid);
> > 
> > alloc_nodemask_of_node() isn't defined anywhere.
> 
> 
> Well, that's because the patch that defines it is in a message that I
> meant to send before this one.  I see it's in my Drafts folder.  I'll
> attach that patch below.  I'm rebasing against the 0827 mmotm, and I'll
> resend the rebased series.  However, I wanted to get your opinion of the
> nodemask patch below.
> 

It looks very reasonable to my eye. The caller must know that kfree() is
used to free it instead of free_nodemask_of_node() but it's not worth
getting into a twist over.

> <SNIP>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
@ 2009-08-28 10:09                 ` Mel Gorman
  0 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2009-08-28 10:09 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Thu, Aug 27, 2009 at 12:52:10PM -0400, Lee Schermerhorn wrote:
> <snip>
> 
> > > @@ -1253,7 +1255,21 @@ static unsigned long set_max_huge_pages(
> > >  	if (h->order >= MAX_ORDER)
> > >  		return h->max_huge_pages;
> > >  
> > > -	nodes_allowed = huge_mpol_nodes_allowed();
> > > +	if (nid == NO_NODEID_SPECIFIED)
> > > +		nodes_allowed = huge_mpol_nodes_allowed();
> > > +	else {
> > > +		/*
> > > +		 * incoming 'count' is for node 'nid' only, so
> > > +		 * adjust count to global, but restrict alloc/free
> > > +		 * to the specified node.
> > > +		 */
> > > +		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> > > +		nodes_allowed = alloc_nodemask_of_node(nid);
> > 
> > alloc_nodemask_of_node() isn't defined anywhere.
> 
> 
> Well, that's because the patch that defines it is in a message that I
> meant to send before this one.  I see it's in my Drafts folder.  I'll
> attach that patch below.  I'm rebasing against the 0827 mmotm, and I'll
> resend the rebased series.  However, I wanted to get your opinion of the
> nodemask patch below.
> 

It looks very reasonable to my eye. The caller must know that kfree() is
used to free it instead of free_nodemask_of_node() but it's not worth
getting into a twist over.

> <SNIP>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] hugetlb:  add per node hstate attributes
  2009-08-27 19:35               ` David Rientjes
@ 2009-08-28 12:56                 ` Lee Schermerhorn
  0 siblings, 0 replies; 51+ messages in thread
From: Lee Schermerhorn @ 2009-08-28 12:56 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, linux-mm, linux-numa, Andrew Morton,
	Nishanth Aravamudan, Adam Litke, Andy Whitcroft, eric.whitney

On Thu, 2009-08-27 at 12:35 -0700, David Rientjes wrote:
> On Wed, 26 Aug 2009, Lee Schermerhorn wrote:
> 
> > > I think it would probably be better to use the generic NODEMASK_ALLOC() 
> > > interface by requiring it to pass the entire type (including "struct") as 
> > > part of the first parameter.  Then it automatically takes care of 
> > > dynamically allocating large nodemasks vs. allocating them on the stack.
> > > 
> > > Would it work by redefining NODEMASK_ALLOC() in the NODES_SHIFT > 8 case 
> > > to be this:
> > > 
> > > 	#define NODEMASK_ALLOC(x, m) x *m = kmalloc(sizeof(*m), GFP_KERNEL);
> > > 
> > > and converting NODEMASK_SCRATCH(x) to NODEMASK_ALLOC(struct 
> > > nodemask_scratch, x), and then doing this in your code:
> > > 
> > > 	NODEMASK_ALLOC(nodemask_t, nodes_allowed);
> > > 	if (nodes_allowed)
> > > 		*nodes_allowed = nodemask_of_node(node);
> > > 
> > > The NODEMASK_{ALLOC,SCRATCH}() interface is in its infancy so it can 
> > > probably be made more general to handle cases like this.
> > 
> > I just don't know what that would accomplish.  Heck, I'm not all that
> > happy with the alloc_nodemask_from_node() because it's allocating both a
> > hidden nodemask_t and a pointer thereto on the stack just to return a
> > pointer to a kmalloc()ed nodemask_t--which is what I want/need here.
> > 
> > One issue I have with NODEMASK_ALLOC() [and nodemask_of_node(), et al]
> > is that it declares the pointer variable as well as initializing it,
> > perhaps with kmalloc(), ...   Indeed, it's purpose is to replace on
> > stack nodemask declarations.
> > 
> 
> Right, which is why I suggest we only have one such interface to 
> dynamically allocate nodemasks when NODES_SHIFT > 8.  That's what defines 
> NODEMASK_ALLOC() as being special: it's taking NODES_SHIFT into 
> consideration just like CPUMASK_ALLOC() would take NR_CPUS into 
> consideration.  Your use case is the intended purpose of NODEMASK_ALLOC() 
> and I see no reason why your code can't use the same interface with some 
> modification and it's in the best interest of a maintainability to not 
> duplicate specialized cases where pre-existing interfaces can be used (or 
> improved, in this case).
> 
> > So, to use it at the start of, e.g., set_max_huge_pages() where I can
> > safely use it throughout the function, I'll end up allocating the
> > nodes_allowed mask on every call, whether or not a node is specified or
> > there is a non-default mempolicy.  If it turns out that no node was
> > specified and we have default policy, we need to free the mask and NULL
> > out nodes_allowed up front so that we get default behavior.  That seems
> > uglier to me that only allocating the nodemask when we know we need one.
> > 
> 
> Not with my suggested code of disabling local irqs, getting a reference to 
> the mempolicy so it can't be freed, reenabling, and then only using 
> NODEMASK_ALLOC() in the switch statement on mpol->mode for MPOL_PREFERRED.
> 
> > I'm not opposed to using a generic function/macro where one exists that
> > suits my purposes.   I just don't see one.  I tried to create
> > one--alloc_nodemask_from_node(), and to keep Mel happy, I tried to reuse
> > nodemask_from_node() to initialize it.  I'm really not happy with the
> > results--because of those extra, hidden stack variables.  I could
> > eliminate those by creating a out of line function, but there's no good
> > place to put a generic nodemask function--no nodemask.c.  
> > 
> 
> Using NODEMASK_ALLOC(nodes_allowed) wouldn't really be a hidden stack 
> variable, would it?  I think most developers would assume that it is 
> some automatic variable called `nodes_allowed' since it's later referenced 
> (and only needs to be in the case of MPOL_PREFERRED if my mpol_get() 
> solution with disabled local irqs is used).

David:  

I'm going to repost my series with the version of
alloc_nodemask_of_node() that I sent our yesterday.  My entire
implementation is based on nodes_allowed, in set_max_huge_pages() being
a pointer to a nodemask.  nodes_allowed must be NULL for default
behavior [NO_NODEID_SPECIFIED && default mempolicy].  It only gets
allocated when nid >0 or task has non-default memory policy.  This seems
to work fairly well for both the mempolicy based constraint and the per
node attributes.  Please take a look at this series.  If you want to
propose a patch to rework the nodes_allowed allocation, have at it.  I'm
satisfied with the current implementation.

Now, we have a couple of options:  Mel said he's willing to proceed with
the mempolicy based constraint and leave the per node attributes to a
follow up submit.  If you want to take over the per node attributes
feature and rework it, I can extract it from the series, including the
doc update and turn it over to you.  Or, we can try to submit the
current implementation and follow up with patches to rework the generic
nodemask support as you propose.

Let me know how you want to proceed.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2009-08-28 12:56 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-24 19:24 [PATCH 0/5] hugetlb: numa control of persistent huge pages alloc/free Lee Schermerhorn
2009-08-24 19:25 ` [PATCH 1/5] hugetlb: rework hstate_next_node_* functions Lee Schermerhorn
2009-08-25  8:10   ` David Rientjes
2009-08-25  8:10     ` David Rientjes
2009-08-24 19:26 ` [PATCH 2/5] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns Lee Schermerhorn
2009-08-25  8:16   ` David Rientjes
2009-08-25  8:16     ` David Rientjes
2009-08-25 20:49     ` Lee Schermerhorn
2009-08-25 20:49       ` Lee Schermerhorn
2009-08-25 21:59       ` David Rientjes
2009-08-25 21:59         ` David Rientjes
2009-08-26  9:58       ` Mel Gorman
2009-08-26  9:58         ` Mel Gorman
2009-08-24 19:27 ` [PATCH 3/5] hugetlb: derive huge pages nodes allowed from task mempolicy Lee Schermerhorn
2009-08-25  8:47   ` David Rientjes
2009-08-25  8:47     ` David Rientjes
2009-08-25 20:49     ` Lee Schermerhorn
2009-08-25 20:49       ` Lee Schermerhorn
2009-08-27 19:40       ` David Rientjes
2009-08-27 19:40         ` David Rientjes
2009-08-25 10:22   ` Mel Gorman
2009-08-25 10:22     ` Mel Gorman
2009-08-24 19:29 ` [PATCH 4/5] hugetlb: add per node hstate attributes Lee Schermerhorn
2009-08-25 10:19   ` Mel Gorman
2009-08-25 10:19     ` Mel Gorman
2009-08-25 20:49     ` Lee Schermerhorn
2009-08-25 20:49       ` Lee Schermerhorn
2009-08-26 10:11       ` Mel Gorman
2009-08-26 10:11         ` Mel Gorman
2009-08-26 18:02         ` Lee Schermerhorn
2009-08-26 18:02           ` Lee Schermerhorn
2009-08-26 19:47           ` David Rientjes
2009-08-26 19:47             ` David Rientjes
2009-08-26 20:46             ` Lee Schermerhorn
2009-08-26 20:46               ` Lee Schermerhorn
2009-08-27  9:52               ` Mel Gorman
2009-08-27  9:52                 ` Mel Gorman
2009-08-27 19:35               ` David Rientjes
2009-08-28 12:56                 ` Lee Schermerhorn
2009-08-26 18:04         ` Lee Schermerhorn
2009-08-27 10:23           ` Mel Gorman
2009-08-27 16:52             ` Lee Schermerhorn
2009-08-28 10:09               ` Mel Gorman
2009-08-28 10:09                 ` Mel Gorman
2009-08-25 13:35   ` Mel Gorman
2009-08-25 13:35     ` Mel Gorman
2009-08-25 20:49     ` Lee Schermerhorn
2009-08-25 20:49       ` Lee Schermerhorn
2009-08-26 10:12       ` Mel Gorman
2009-08-26 10:12         ` Mel Gorman
2009-08-24 19:30 ` [PATCH 5/5] hugetlb: update hugetlb documentation for mempolicy based management Lee Schermerhorn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.