All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] hugetlb: V1 Per Node Hugepages attributes
@ 2009-07-29 18:11 ` Lee Schermerhorn
  0 siblings, 0 replies; 31+ messages in thread
From: Lee Schermerhorn @ 2009-07-29 18:11 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Greg KH, Nishanth Aravamudan, andi,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

PATCH/RFC 0/4  V1 Add Per Node Hugepages Attributes

Against:  2.6.31-rc3-mmotm-090716-1432
atop the previously posted alloc_bootmem_hugepages fix.
[http://marc.info/?l=linux-mm&m=124775468226290&w=4]

This is V1 of a third alternative for controlling allocation of
persistent huge pages on a NUMA system.  [Prior alternatives were
separate "hugepages_nodes_allowed" mask and mempolicy-based mask.]
This series implements a per node, per huge pages size, read/write
attribute--nr_hugepages--to query and modify the persistent huge
pages on a specific node.  The series also implements read only
attributes to query free_huge_pages and surplus_free_pages.

This implementation continues to pass the libhugetlbfs functional test
suite.

Some issues/limitations with this series:

1) The series includes a rework/cleanup patch from the "mempolicy-
   based" huge pages series.  I think this rework is worth doing
   which ever method we chose for controlling per node huge pages.

2) The series extends the struct kobject with a private bit field
   to aid the correlation of kobjects with the global or per node
   hstate attributes.  This is not absolutely required, but did 
   simplify the back mapping of kobjects to subsystem objects.

3) The reserved and overcommit counts remain global.  This seems to
   be the most straightforward usage, even in the context of per node
   persistent huge page attributes.  Global reserve and overcommit
   values allow mempolicy to be applied to the huge page allocation
   to satisfy a page fault.  [Some work appears to be needed in
   the per cpuset overcommit limit and reserve accounting, but
   outside of the scope of this series.]

4) This series does not implement a boot command line parameter to
   control per node allocations.  This could be added if needed.

5) Using this method--per node attributes--to control persistent
   huge page allocation will require enhancments to hugeadm, 
   including a new command line syntax for specifying specific
   nodes if we wish to avoid direct accessing of the attributes.

6) I have yet to update the hugetlbfs doc for this alternative.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 0/4] hugetlb: V1 Per Node Hugepages attributes
@ 2009-07-29 18:11 ` Lee Schermerhorn
  0 siblings, 0 replies; 31+ messages in thread
From: Lee Schermerhorn @ 2009-07-29 18:11 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Greg KH, Nishanth Aravamudan, andi,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

PATCH/RFC 0/4  V1 Add Per Node Hugepages Attributes

Against:  2.6.31-rc3-mmotm-090716-1432
atop the previously posted alloc_bootmem_hugepages fix.
[http://marc.info/?l=linux-mm&m=124775468226290&w=4]

This is V1 of a third alternative for controlling allocation of
persistent huge pages on a NUMA system.  [Prior alternatives were
separate "hugepages_nodes_allowed" mask and mempolicy-based mask.]
This series implements a per node, per huge pages size, read/write
attribute--nr_hugepages--to query and modify the persistent huge
pages on a specific node.  The series also implements read only
attributes to query free_huge_pages and surplus_free_pages.

This implementation continues to pass the libhugetlbfs functional test
suite.

Some issues/limitations with this series:

1) The series includes a rework/cleanup patch from the "mempolicy-
   based" huge pages series.  I think this rework is worth doing
   which ever method we chose for controlling per node huge pages.

2) The series extends the struct kobject with a private bit field
   to aid the correlation of kobjects with the global or per node
   hstate attributes.  This is not absolutely required, but did 
   simplify the back mapping of kobjects to subsystem objects.

3) The reserved and overcommit counts remain global.  This seems to
   be the most straightforward usage, even in the context of per node
   persistent huge page attributes.  Global reserve and overcommit
   values allow mempolicy to be applied to the huge page allocation
   to satisfy a page fault.  [Some work appears to be needed in
   the per cpuset overcommit limit and reserve accounting, but
   outside of the scope of this series.]

4) This series does not implement a boot command line parameter to
   control per node allocations.  This could be added if needed.

5) Using this method--per node attributes--to control persistent
   huge page allocation will require enhancments to hugeadm, 
   including a new command line syntax for specifying specific
   nodes if we wish to avoid direct accessing of the attributes.

6) I have yet to update the hugetlbfs doc for this alternative.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/4] hugetlb:  rework hstate_next_node_* functions
  2009-07-29 18:11 ` Lee Schermerhorn
  (?)
@ 2009-07-29 18:11 ` Lee Schermerhorn
  -1 siblings, 0 replies; 31+ messages in thread
From: Lee Schermerhorn @ 2009-07-29 18:11 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Greg KH, Nishanth Aravamudan, andi,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

PATCH/RFC 1/4 hugetlb:  rework hstate_next_node* functions

Against: 2.6.31-rc3-mmotm-090716-1432
atop the previously posted alloc_bootmem_hugepages fix.
[http://marc.info/?l=linux-mm&m=124775468226290&w=4]

[From V3 of the mempolicy-based huge pages allocation series]

Modify the hstate_next_node* functions to allow them to be called to
obtain the "start_nid".  Then, whereas prior to this patch we
unconditionally called hstate_next_node_to_{alloc|free}(), whether
or not we successfully allocated/freed a huge page on the node,
now we only call these functions on failure to alloc/free.

Factor out the next_node_allowed() function to handle wrap at end
of node_online_map.  In this version, the allowed nodes are all 
of the online nodes.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/hugetlb.c |   70 +++++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 45 insertions(+), 25 deletions(-)

Index: linux-2.6.31-rc3-mmotm-090716-1432/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc3-mmotm-090716-1432.orig/mm/hugetlb.c	2009-07-22 15:42:46.000000000 -0400
+++ linux-2.6.31-rc3-mmotm-090716-1432/mm/hugetlb.c	2009-07-22 15:42:48.000000000 -0400
@@ -622,6 +622,20 @@ static struct page *alloc_fresh_huge_pag
 }
 
 /*
+ * common helper function for hstate_next_node_to_{alloc|free}.
+ * return next node in node_online_map, wrapping at end.
+ */
+static int next_node_allowed(int nid)
+{
+	nid = next_node(nid, node_online_map);
+	if (nid == MAX_NUMNODES)
+		nid = first_node(node_online_map);
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+
+	return nid;
+}
+
+/*
  * Use a helper variable to find the next node and then
  * copy it back to next_nid_to_alloc afterwards:
  * otherwise there's a window in which a racer might
@@ -634,12 +648,12 @@ static struct page *alloc_fresh_huge_pag
  */
 static int hstate_next_node_to_alloc(struct hstate *h)
 {
-	int next_nid;
-	next_nid = next_node(h->next_nid_to_alloc, node_online_map);
-	if (next_nid == MAX_NUMNODES)
-		next_nid = first_node(node_online_map);
+	int nid, next_nid;
+
+	nid = h->next_nid_to_alloc;
+	next_nid = next_node_allowed(nid);
 	h->next_nid_to_alloc = next_nid;
-	return next_nid;
+	return nid;
 }
 
 static int alloc_fresh_huge_page(struct hstate *h)
@@ -649,15 +663,17 @@ static int alloc_fresh_huge_page(struct 
 	int next_nid;
 	int ret = 0;
 
-	start_nid = h->next_nid_to_alloc;
+	start_nid = hstate_next_node_to_alloc(h);
 	next_nid = start_nid;
 
 	do {
 		page = alloc_fresh_huge_page_node(h, next_nid);
-		if (page)
+		if (page) {
 			ret = 1;
+			break;
+		}
 		next_nid = hstate_next_node_to_alloc(h);
-	} while (!page && next_nid != start_nid);
+	} while (next_nid != start_nid);
 
 	if (ret)
 		count_vm_event(HTLB_BUDDY_PGALLOC);
@@ -668,17 +684,19 @@ static int alloc_fresh_huge_page(struct 
 }
 
 /*
- * helper for free_pool_huge_page() - find next node
- * from which to free a huge page
+ * helper for free_pool_huge_page() - return the next node
+ * from which to free a huge page.  Advance the next node id
+ * whether or not we find a free huge page to free so that the
+ * next attempt to free addresses the next node.
  */
 static int hstate_next_node_to_free(struct hstate *h)
 {
-	int next_nid;
-	next_nid = next_node(h->next_nid_to_free, node_online_map);
-	if (next_nid == MAX_NUMNODES)
-		next_nid = first_node(node_online_map);
+	int nid, next_nid;
+
+	nid = h->next_nid_to_free;
+	next_nid = next_node_allowed(nid);
 	h->next_nid_to_free = next_nid;
-	return next_nid;
+	return nid;
 }
 
 /*
@@ -693,7 +711,7 @@ static int free_pool_huge_page(struct hs
 	int next_nid;
 	int ret = 0;
 
-	start_nid = h->next_nid_to_free;
+	start_nid = hstate_next_node_to_free(h);
 	next_nid = start_nid;
 
 	do {
@@ -715,9 +733,10 @@ static int free_pool_huge_page(struct hs
 			}
 			update_and_free_page(h, page);
 			ret = 1;
+			break;
 		}
 		next_nid = hstate_next_node_to_free(h);
-	} while (!ret && next_nid != start_nid);
+	} while (next_nid != start_nid);
 
 	return ret;
 }
@@ -1028,10 +1047,9 @@ int __weak alloc_bootmem_huge_page(struc
 		void *addr;
 
 		addr = __alloc_bootmem_node_nopanic(
-				NODE_DATA(h->next_nid_to_alloc),
+				NODE_DATA(hstate_next_node_to_alloc(h)),
 				huge_page_size(h), huge_page_size(h), 0);
 
-		hstate_next_node_to_alloc(h);
 		if (addr) {
 			/*
 			 * Use the beginning of the huge page to store the
@@ -1167,29 +1185,31 @@ static int adjust_pool_surplus(struct hs
 	VM_BUG_ON(delta != -1 && delta != 1);
 
 	if (delta < 0)
-		start_nid = h->next_nid_to_alloc;
+		start_nid = hstate_next_node_to_alloc(h);
 	else
-		start_nid = h->next_nid_to_free;
+		start_nid = hstate_next_node_to_free(h);
 	next_nid = start_nid;
 
 	do {
 		int nid = next_nid;
 		if (delta < 0)  {
-			next_nid = hstate_next_node_to_alloc(h);
 			/*
 			 * To shrink on this node, there must be a surplus page
 			 */
-			if (!h->surplus_huge_pages_node[nid])
+			if (!h->surplus_huge_pages_node[nid]) {
+				next_nid = hstate_next_node_to_alloc(h);
 				continue;
+			}
 		}
 		if (delta > 0) {
-			next_nid = hstate_next_node_to_free(h);
 			/*
 			 * Surplus cannot exceed the total number of pages
 			 */
 			if (h->surplus_huge_pages_node[nid] >=
-						h->nr_huge_pages_node[nid])
+						h->nr_huge_pages_node[nid]) {
+				next_nid = hstate_next_node_to_free(h);
 				continue;
+			}
 		}
 
 		h->surplus_huge_pages += delta;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 2/4] hugetlb:  numafy several functions
  2009-07-29 18:11 ` Lee Schermerhorn
@ 2009-07-29 18:11   ` Lee Schermerhorn
  -1 siblings, 0 replies; 31+ messages in thread
From: Lee Schermerhorn @ 2009-07-29 18:11 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Greg KH, Nishanth Aravamudan, andi,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

PATCH/RFC 2/4 hugetlb:  numafy several functions

Against: 2.6.31-rc3-mmotm-090716-1432
atop the previously posted alloc_bootmem_hugepages fix.
[http://marc.info/?l=linux-mm&m=124775468226290&w=4]

Based on a patch by Nishanth Aravamudan <nacc@us.ibm.com>, circa
april2008.

Factor out functions to dequeue and free huge pages and/or adjust
surplus huge page count for a specific node in support of subsequent
patch to alloc or free huge pages on a specified node.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

---
 mm/hugetlb.c |  126 +++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 72 insertions(+), 54 deletions(-)

Index: linux-2.6.31-rc3-mmotm-090716-1432/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc3-mmotm-090716-1432.orig/mm/hugetlb.c	2009-07-23 11:10:29.000000000 -0400
+++ linux-2.6.31-rc3-mmotm-090716-1432/mm/hugetlb.c	2009-07-27 15:26:39.000000000 -0400
@@ -456,6 +456,17 @@ static void enqueue_huge_page(struct hst
 	h->free_huge_pages_node[nid]++;
 }
 
+static struct page *hstate_dequeue_huge_page_node(struct hstate *h, int nid)
+{
+	struct page *page;
+
+	page = list_entry(h->hugepage_freelists[nid].next, struct page, lru);
+	list_del(&page->lru);
+	h->free_huge_pages--;
+	h->free_huge_pages_node[nid]--;
+	return page;
+}
+
 static struct page *dequeue_huge_page_vma(struct hstate *h,
 				struct vm_area_struct *vma,
 				unsigned long address, int avoid_reserve)
@@ -487,11 +498,7 @@ static struct page *dequeue_huge_page_vm
 		nid = zone_to_nid(zone);
 		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) &&
 		    !list_empty(&h->hugepage_freelists[nid])) {
-			page = list_entry(h->hugepage_freelists[nid].next,
-					  struct page, lru);
-			list_del(&page->lru);
-			h->free_huge_pages--;
-			h->free_huge_pages_node[nid]--;
+			page = hstate_dequeue_huge_page_node(h, nid);
 
 			if (!avoid_reserve)
 				decrement_hugepage_resv_vma(h, vma);
@@ -599,12 +606,12 @@ int PageHuge(struct page *page)
 	return dtor == free_huge_page;
 }
 
-static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
+static int alloc_fresh_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
 
 	if (h->order >= MAX_ORDER)
-		return NULL;
+		return 0;
 
 	page = alloc_pages_exact_node(nid,
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
@@ -613,12 +620,12 @@ static struct page *alloc_fresh_huge_pag
 	if (page) {
 		if (arch_prepare_hugepage(page)) {
 			__free_pages(page, huge_page_order(h));
-			return NULL;
+			return 0;
 		}
 		prep_new_huge_page(h, page, nid);
 	}
 
-	return page;
+	return 1;
 }
 
 /*
@@ -658,7 +665,6 @@ static int hstate_next_node_to_alloc(str
 
 static int alloc_fresh_huge_page(struct hstate *h)
 {
-	struct page *page;
 	int start_nid;
 	int next_nid;
 	int ret = 0;
@@ -667,11 +673,9 @@ static int alloc_fresh_huge_page(struct 
 	next_nid = start_nid;
 
 	do {
-		page = alloc_fresh_huge_page_node(h, next_nid);
-		if (page) {
-			ret = 1;
+		ret = alloc_fresh_huge_page_node(h, next_nid);
+		if (ret)
 			break;
-		}
 		next_nid = hstate_next_node_to_alloc(h);
 	} while (next_nid != start_nid);
 
@@ -699,6 +703,23 @@ static int hstate_next_node_to_free(stru
 	return nid;
 }
 
+static int hstate_free_huge_page_node(struct hstate *h, bool acct_surplus,
+							int nid)
+{
+	struct page *page;
+
+	if (list_empty(&h->hugepage_freelists[nid]))
+		return 0;
+
+	page = hstate_dequeue_huge_page_node(h, nid);
+	if (acct_surplus) {
+		h->surplus_huge_pages--;
+		h->surplus_huge_pages_node[nid]--;
+	}
+	update_and_free_page(h, page);
+	return 1;
+}
+
 /*
  * Free huge page from pool from next node to free.
  * Attempt to keep persistent huge pages more or less
@@ -719,21 +740,11 @@ static int free_pool_huge_page(struct hs
 		 * If we're returning unused surplus pages, only examine
 		 * nodes with surplus pages.
 		 */
-		if ((!acct_surplus || h->surplus_huge_pages_node[next_nid]) &&
-		    !list_empty(&h->hugepage_freelists[next_nid])) {
-			struct page *page =
-				list_entry(h->hugepage_freelists[next_nid].next,
-					  struct page, lru);
-			list_del(&page->lru);
-			h->free_huge_pages--;
-			h->free_huge_pages_node[next_nid]--;
-			if (acct_surplus) {
-				h->surplus_huge_pages--;
-				h->surplus_huge_pages_node[next_nid]--;
-			}
-			update_and_free_page(h, page);
-			ret = 1;
-			break;
+		if ((!acct_surplus || h->surplus_huge_pages_node[next_nid])) {
+			ret = hstate_free_huge_page_node(h, acct_surplus,
+			                                    next_nid);
+			if (ret)
+				break;
 		}
 		next_nid = hstate_next_node_to_free(h);
 	} while (next_nid != start_nid);
@@ -1173,6 +1184,31 @@ static inline void try_to_free_low(struc
 #endif
 
 /*
+ * Increment or decrement surplus_huge_pages for a specified node,
+ * if conditions permit.  Note that decrementing the surplus huge
+ * page count effective promotes a page to persistent, while
+ * incrementing the surplus count demotes a page to surplus.
+ */
+static int adjust_pool_surplus_node(struct hstate *h, int delta, int nid)
+{
+	int ret = 0;
+
+	/*
+	 * To shrink on this node, there must be a surplus page.
+	 * Surplus cannot exceed the total number of pages.
+	 */
+	if ((delta < 0 && h->surplus_huge_pages_node[nid]) ||
+	    (delta > 0 && h->surplus_huge_pages_node[nid] <
+					h->nr_huge_pages_node[nid])) {
+
+		h->surplus_huge_pages += delta;
+		h->surplus_huge_pages_node[nid] += delta;
+		ret = 1;
+	}
+	return ret;
+}
+
+/*
  * Increment or decrement surplus_huge_pages.  Keep node-specific counters
  * balanced by operating on them in a round-robin fashion.
  * Returns 1 if an adjustment was made.
@@ -1191,31 +1227,13 @@ static int adjust_pool_surplus(struct hs
 	next_nid = start_nid;
 
 	do {
-		int nid = next_nid;
-		if (delta < 0)  {
-			/*
-			 * To shrink on this node, there must be a surplus page
-			 */
-			if (!h->surplus_huge_pages_node[nid]) {
-				next_nid = hstate_next_node_to_alloc(h);
-				continue;
-			}
-		}
-		if (delta > 0) {
-			/*
-			 * Surplus cannot exceed the total number of pages
-			 */
-			if (h->surplus_huge_pages_node[nid] >=
-						h->nr_huge_pages_node[nid]) {
-				next_nid = hstate_next_node_to_free(h);
-				continue;
-			}
-		}
-
-		h->surplus_huge_pages += delta;
-		h->surplus_huge_pages_node[nid] += delta;
-		ret = 1;
-		break;
+		ret = adjust_pool_surplus_node(h, delta, next_nid);
+		if (ret)
+			break;
+		if (delta < 0)
+			next_nid = hstate_next_node_to_alloc(h);
+		else
+			next_nid = hstate_next_node_to_free(h);
 	} while (next_nid != start_nid);
 
 	return ret;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 2/4] hugetlb:  numafy several functions
@ 2009-07-29 18:11   ` Lee Schermerhorn
  0 siblings, 0 replies; 31+ messages in thread
From: Lee Schermerhorn @ 2009-07-29 18:11 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Greg KH, Nishanth Aravamudan, andi,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

PATCH/RFC 2/4 hugetlb:  numafy several functions

Against: 2.6.31-rc3-mmotm-090716-1432
atop the previously posted alloc_bootmem_hugepages fix.
[http://marc.info/?l=linux-mm&m=124775468226290&w=4]

Based on a patch by Nishanth Aravamudan <nacc@us.ibm.com>, circa
april2008.

Factor out functions to dequeue and free huge pages and/or adjust
surplus huge page count for a specific node in support of subsequent
patch to alloc or free huge pages on a specified node.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

---
 mm/hugetlb.c |  126 +++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 72 insertions(+), 54 deletions(-)

Index: linux-2.6.31-rc3-mmotm-090716-1432/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc3-mmotm-090716-1432.orig/mm/hugetlb.c	2009-07-23 11:10:29.000000000 -0400
+++ linux-2.6.31-rc3-mmotm-090716-1432/mm/hugetlb.c	2009-07-27 15:26:39.000000000 -0400
@@ -456,6 +456,17 @@ static void enqueue_huge_page(struct hst
 	h->free_huge_pages_node[nid]++;
 }
 
+static struct page *hstate_dequeue_huge_page_node(struct hstate *h, int nid)
+{
+	struct page *page;
+
+	page = list_entry(h->hugepage_freelists[nid].next, struct page, lru);
+	list_del(&page->lru);
+	h->free_huge_pages--;
+	h->free_huge_pages_node[nid]--;
+	return page;
+}
+
 static struct page *dequeue_huge_page_vma(struct hstate *h,
 				struct vm_area_struct *vma,
 				unsigned long address, int avoid_reserve)
@@ -487,11 +498,7 @@ static struct page *dequeue_huge_page_vm
 		nid = zone_to_nid(zone);
 		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) &&
 		    !list_empty(&h->hugepage_freelists[nid])) {
-			page = list_entry(h->hugepage_freelists[nid].next,
-					  struct page, lru);
-			list_del(&page->lru);
-			h->free_huge_pages--;
-			h->free_huge_pages_node[nid]--;
+			page = hstate_dequeue_huge_page_node(h, nid);
 
 			if (!avoid_reserve)
 				decrement_hugepage_resv_vma(h, vma);
@@ -599,12 +606,12 @@ int PageHuge(struct page *page)
 	return dtor == free_huge_page;
 }
 
-static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
+static int alloc_fresh_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
 
 	if (h->order >= MAX_ORDER)
-		return NULL;
+		return 0;
 
 	page = alloc_pages_exact_node(nid,
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
@@ -613,12 +620,12 @@ static struct page *alloc_fresh_huge_pag
 	if (page) {
 		if (arch_prepare_hugepage(page)) {
 			__free_pages(page, huge_page_order(h));
-			return NULL;
+			return 0;
 		}
 		prep_new_huge_page(h, page, nid);
 	}
 
-	return page;
+	return 1;
 }
 
 /*
@@ -658,7 +665,6 @@ static int hstate_next_node_to_alloc(str
 
 static int alloc_fresh_huge_page(struct hstate *h)
 {
-	struct page *page;
 	int start_nid;
 	int next_nid;
 	int ret = 0;
@@ -667,11 +673,9 @@ static int alloc_fresh_huge_page(struct 
 	next_nid = start_nid;
 
 	do {
-		page = alloc_fresh_huge_page_node(h, next_nid);
-		if (page) {
-			ret = 1;
+		ret = alloc_fresh_huge_page_node(h, next_nid);
+		if (ret)
 			break;
-		}
 		next_nid = hstate_next_node_to_alloc(h);
 	} while (next_nid != start_nid);
 
@@ -699,6 +703,23 @@ static int hstate_next_node_to_free(stru
 	return nid;
 }
 
+static int hstate_free_huge_page_node(struct hstate *h, bool acct_surplus,
+							int nid)
+{
+	struct page *page;
+
+	if (list_empty(&h->hugepage_freelists[nid]))
+		return 0;
+
+	page = hstate_dequeue_huge_page_node(h, nid);
+	if (acct_surplus) {
+		h->surplus_huge_pages--;
+		h->surplus_huge_pages_node[nid]--;
+	}
+	update_and_free_page(h, page);
+	return 1;
+}
+
 /*
  * Free huge page from pool from next node to free.
  * Attempt to keep persistent huge pages more or less
@@ -719,21 +740,11 @@ static int free_pool_huge_page(struct hs
 		 * If we're returning unused surplus pages, only examine
 		 * nodes with surplus pages.
 		 */
-		if ((!acct_surplus || h->surplus_huge_pages_node[next_nid]) &&
-		    !list_empty(&h->hugepage_freelists[next_nid])) {
-			struct page *page =
-				list_entry(h->hugepage_freelists[next_nid].next,
-					  struct page, lru);
-			list_del(&page->lru);
-			h->free_huge_pages--;
-			h->free_huge_pages_node[next_nid]--;
-			if (acct_surplus) {
-				h->surplus_huge_pages--;
-				h->surplus_huge_pages_node[next_nid]--;
-			}
-			update_and_free_page(h, page);
-			ret = 1;
-			break;
+		if ((!acct_surplus || h->surplus_huge_pages_node[next_nid])) {
+			ret = hstate_free_huge_page_node(h, acct_surplus,
+			                                    next_nid);
+			if (ret)
+				break;
 		}
 		next_nid = hstate_next_node_to_free(h);
 	} while (next_nid != start_nid);
@@ -1173,6 +1184,31 @@ static inline void try_to_free_low(struc
 #endif
 
 /*
+ * Increment or decrement surplus_huge_pages for a specified node,
+ * if conditions permit.  Note that decrementing the surplus huge
+ * page count effective promotes a page to persistent, while
+ * incrementing the surplus count demotes a page to surplus.
+ */
+static int adjust_pool_surplus_node(struct hstate *h, int delta, int nid)
+{
+	int ret = 0;
+
+	/*
+	 * To shrink on this node, there must be a surplus page.
+	 * Surplus cannot exceed the total number of pages.
+	 */
+	if ((delta < 0 && h->surplus_huge_pages_node[nid]) ||
+	    (delta > 0 && h->surplus_huge_pages_node[nid] <
+					h->nr_huge_pages_node[nid])) {
+
+		h->surplus_huge_pages += delta;
+		h->surplus_huge_pages_node[nid] += delta;
+		ret = 1;
+	}
+	return ret;
+}
+
+/*
  * Increment or decrement surplus_huge_pages.  Keep node-specific counters
  * balanced by operating on them in a round-robin fashion.
  * Returns 1 if an adjustment was made.
@@ -1191,31 +1227,13 @@ static int adjust_pool_surplus(struct hs
 	next_nid = start_nid;
 
 	do {
-		int nid = next_nid;
-		if (delta < 0)  {
-			/*
-			 * To shrink on this node, there must be a surplus page
-			 */
-			if (!h->surplus_huge_pages_node[nid]) {
-				next_nid = hstate_next_node_to_alloc(h);
-				continue;
-			}
-		}
-		if (delta > 0) {
-			/*
-			 * Surplus cannot exceed the total number of pages
-			 */
-			if (h->surplus_huge_pages_node[nid] >=
-						h->nr_huge_pages_node[nid]) {
-				next_nid = hstate_next_node_to_free(h);
-				continue;
-			}
-		}
-
-		h->surplus_huge_pages += delta;
-		h->surplus_huge_pages_node[nid] += delta;
-		ret = 1;
-		break;
+		ret = adjust_pool_surplus_node(h, delta, next_nid);
+		if (ret)
+			break;
+		if (delta < 0)
+			next_nid = hstate_next_node_to_alloc(h);
+		else
+			next_nid = hstate_next_node_to_free(h);
 	} while (next_nid != start_nid);
 
 	return ret;

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 3/4] hugetlb:  add private bit-field to kobject structure
  2009-07-29 18:11 ` Lee Schermerhorn
@ 2009-07-29 18:11   ` Lee Schermerhorn
  -1 siblings, 0 replies; 31+ messages in thread
From: Lee Schermerhorn @ 2009-07-29 18:11 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Greg KH, Nishanth Aravamudan, andi,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

PATCH/RFC 3/4 hugetlb:  add private bitfield to struct kobject

Against: 2.6.31-rc3-mmotm-090716-1432
atop the previously posted alloc_bootmem_hugepages fix.
[http://marc.info/?l=linux-mm&m=124775468226290&w=4]

For the per node huge page attributes, we want to share
as much code as possible with the global huge page attributes,
including the show/store functions.  To do this, we'll need a
way to back translate from the kobj argument to the show/store
function to the node id, when entered via that path.  This
patch adds a subsystem/sysdev private bitfield to the kobject
structure.  The bitfield uses unused bits in the same unsigned
int as the various kobject flags so as not to increase the size
of the structure. 

Currently, the bit field is the minimum required for the huge
pages per node attributes [plus one extra bit].  The field could
be expanded for other usage, should such arise.

Note that this is not absolutely required.  However, using this
private field eliminates an inner loop to scan the per node
hstate kobjects and eliminates scanning entirely for the global
hstate kobjects.


Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

---
 include/linux/kobject.h |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6.31-rc3-mmotm-090716-1432/include/linux/kobject.h
===================================================================
--- linux-2.6.31-rc3-mmotm-090716-1432.orig/include/linux/kobject.h	2009-07-24 10:01:27.000000000 -0400
+++ linux-2.6.31-rc3-mmotm-090716-1432/include/linux/kobject.h	2009-07-24 10:04:34.000000000 -0400
@@ -56,6 +56,8 @@ enum kobject_action {
 	KOBJ_MAX
 };
 
+#define KOBJ_PRIVATE_BITS 3	/* subsystem/sysdev private */
+
 struct kobject {
 	const char		*name;
 	struct list_head	entry;
@@ -69,6 +71,7 @@ struct kobject {
 	unsigned int state_add_uevent_sent:1;
 	unsigned int state_remove_uevent_sent:1;
 	unsigned int uevent_suppress:1;
+	unsigned int private:KOBJ_PRIVATE_BITS;
 };
 
 extern int kobject_set_name(struct kobject *kobj, const char *name, ...)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 3/4] hugetlb:  add private bit-field to kobject structure
@ 2009-07-29 18:11   ` Lee Schermerhorn
  0 siblings, 0 replies; 31+ messages in thread
From: Lee Schermerhorn @ 2009-07-29 18:11 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Greg KH, Nishanth Aravamudan, andi,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

PATCH/RFC 3/4 hugetlb:  add private bitfield to struct kobject

Against: 2.6.31-rc3-mmotm-090716-1432
atop the previously posted alloc_bootmem_hugepages fix.
[http://marc.info/?l=linux-mm&m=124775468226290&w=4]

For the per node huge page attributes, we want to share
as much code as possible with the global huge page attributes,
including the show/store functions.  To do this, we'll need a
way to back translate from the kobj argument to the show/store
function to the node id, when entered via that path.  This
patch adds a subsystem/sysdev private bitfield to the kobject
structure.  The bitfield uses unused bits in the same unsigned
int as the various kobject flags so as not to increase the size
of the structure. 

Currently, the bit field is the minimum required for the huge
pages per node attributes [plus one extra bit].  The field could
be expanded for other usage, should such arise.

Note that this is not absolutely required.  However, using this
private field eliminates an inner loop to scan the per node
hstate kobjects and eliminates scanning entirely for the global
hstate kobjects.


Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

---
 include/linux/kobject.h |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6.31-rc3-mmotm-090716-1432/include/linux/kobject.h
===================================================================
--- linux-2.6.31-rc3-mmotm-090716-1432.orig/include/linux/kobject.h	2009-07-24 10:01:27.000000000 -0400
+++ linux-2.6.31-rc3-mmotm-090716-1432/include/linux/kobject.h	2009-07-24 10:04:34.000000000 -0400
@@ -56,6 +56,8 @@ enum kobject_action {
 	KOBJ_MAX
 };
 
+#define KOBJ_PRIVATE_BITS 3	/* subsystem/sysdev private */
+
 struct kobject {
 	const char		*name;
 	struct list_head	entry;
@@ -69,6 +71,7 @@ struct kobject {
 	unsigned int state_add_uevent_sent:1;
 	unsigned int state_remove_uevent_sent:1;
 	unsigned int uevent_suppress:1;
+	unsigned int private:KOBJ_PRIVATE_BITS;
 };
 
 extern int kobject_set_name(struct kobject *kobj, const char *name, ...)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 4/4] hugetlb:  add per node hstate attributes
  2009-07-29 18:11 ` Lee Schermerhorn
                   ` (3 preceding siblings ...)
  (?)
@ 2009-07-29 18:12 ` Lee Schermerhorn
  2009-07-30 19:39     ` David Rientjes
  -1 siblings, 1 reply; 31+ messages in thread
From: Lee Schermerhorn @ 2009-07-29 18:12 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Greg KH, Nishanth Aravamudan, andi,
	David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

PATCH/RFC 4/4 hugetlb:  register per node hugepages attributes

Against: 2.6.31-rc3-mmotm-090716-1432
atop the previously posted alloc_bootmem_hugepages fix.
[http://marc.info/?l=linux-mm&m=124775468226290&w=4]

This patch adds the per huge page size control/query attributes
to the per node sysdevs:

/sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
	nr_hugepages       - r/w
	free_huge_pages    - r/o
	surplus_huge_pages - r/o

The patch attempts to re-use/share as much of the existing
global hstate attribute initialization and handling as possible.
Throughout, a node id < 0 indicates global hstate parameters.

Note:  computation of "min_count" in set_max_huge_pages() for a
specified node needs careful review. 

Issue:  dependency of base driver [node] dependency on hugetlbfs module.
We want to keep all of the hstate attribute registration and handling
in the hugetlb module.  However, we need to call into this code to
register the per node hstate attributes on node hot plug.

With this patch:

(me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
./  ../  free_hugepages  nr_hugepages  surplus_hugepages

Starting from:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     0
Node 2 HugePages_Free:      0
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 0

Allocate 16 persistent huge pages on node 2:
(me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

Yields:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:    16
Node 2 HugePages_Free:     16
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 16

Global controls work as expected--reduce pool to 8 persistent huge pages:
(me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     8
Node 2 HugePages_Free:      8
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0





Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

---
 drivers/base/node.c     |    2 
 include/linux/hugetlb.h |    6 +
 include/linux/node.h    |    2 
 mm/hugetlb.c            |  266 +++++++++++++++++++++++++++++++++++++++++-------
 4 files changed, 239 insertions(+), 37 deletions(-)

Index: linux-2.6.31-rc3-mmotm-090716-1432/drivers/base/node.c
===================================================================
--- linux-2.6.31-rc3-mmotm-090716-1432.orig/drivers/base/node.c	2009-07-27 16:23:27.000000000 -0400
+++ linux-2.6.31-rc3-mmotm-090716-1432/drivers/base/node.c	2009-07-27 16:23:28.000000000 -0400
@@ -200,6 +200,7 @@ int register_node(struct node *node, int
 		sysdev_create_file(&node->sysdev, &attr_distance);
 
 		scan_unevictable_register_node(node);
+		hugetlb_register_node(node);
 	}
 	return error;
 }
@@ -220,6 +221,7 @@ void unregister_node(struct node *node)
 	sysdev_remove_file(&node->sysdev, &attr_distance);
 
 	scan_unevictable_unregister_node(node);
+	hugetlb_unregister_node(node);
 
 	sysdev_unregister(&node->sysdev);
 }
Index: linux-2.6.31-rc3-mmotm-090716-1432/include/linux/hugetlb.h
===================================================================
--- linux-2.6.31-rc3-mmotm-090716-1432.orig/include/linux/hugetlb.h	2009-07-27 16:23:27.000000000 -0400
+++ linux-2.6.31-rc3-mmotm-090716-1432/include/linux/hugetlb.h	2009-07-27 16:23:28.000000000 -0400
@@ -278,6 +278,10 @@ static inline struct hstate *page_hstate
 	return size_to_hstate(PAGE_SIZE << compound_order(page));
 }
 
+struct node;
+extern void hugetlb_register_node(struct node *);
+extern void hugetlb_unregister_node(struct node *);
+
 #else
 struct hstate {};
 #define alloc_bootmem_huge_page(h) NULL
@@ -294,6 +298,8 @@ static inline unsigned int pages_per_hug
 {
 	return 1;
 }
+#define hugetlb_register_node(NP)
+#define hugetlb_unregister_node(NP)
 #endif
 
 #endif /* _LINUX_HUGETLB_H */
Index: linux-2.6.31-rc3-mmotm-090716-1432/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc3-mmotm-090716-1432.orig/mm/hugetlb.c	2009-07-27 16:23:27.000000000 -0400
+++ linux-2.6.31-rc3-mmotm-090716-1432/mm/hugetlb.c	2009-07-27 16:23:28.000000000 -0400
@@ -18,6 +18,7 @@
 #include <linux/mutex.h>
 #include <linux/bootmem.h>
 #include <linux/sysfs.h>
+#include <linux/node.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -26,6 +27,10 @@
 #include <linux/hugetlb.h>
 #include "internal.h"
 
+#if (HUGE_MAX_HSTATE > (1 << (KOBJ_PRIVATE_BITS - 1)))
+#error KOBJ_PRIVATE_BITS too small for HUGE_MAX_HSTATE hstates
+#endif
+
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
@@ -1155,14 +1160,22 @@ static void __init report_hugepages(void
 }
 
 #ifdef CONFIG_HIGHMEM
-static void try_to_free_low(struct hstate *h, unsigned long count)
+static void try_to_free_low(struct hstate *h, unsigned long count, int nid)
 {
-	int i;
+	int i, start_i, max_i;
 
 	if (h->order >= MAX_ORDER)
 		return;
 
-	for (i = 0; i < MAX_NUMNODES; ++i) {
+	if (nid < 0) {
+		start_i = 0;
+		max_i = MAX_NUMNODES;
+	} else {
+		start_i = nid;
+		max_i = nid + 1;
+	}
+
+	for (i = start_i; i < max_i; ++i) {
 		struct page *page, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
 		list_for_each_entry_safe(page, next, freel, lru) {
@@ -1178,7 +1191,8 @@ static void try_to_free_low(struct hstat
 	}
 }
 #else
-static inline void try_to_free_low(struct hstate *h, unsigned long count)
+static inline void try_to_free_low(struct hstate *h, unsigned long count,
+								int nid)
 {
 }
 #endif
@@ -1239,8 +1253,17 @@ static int adjust_pool_surplus(struct hs
 	return ret;
 }
 
-#define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
-static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
+static unsigned long persistent_huge_pages(struct hstate *h, int nid)
+{
+	if (nid < 0)
+		return h->nr_huge_pages - h->surplus_huge_pages;
+	else
+		return h->nr_huge_pages_node[nid] -
+			h->surplus_huge_pages_node[nid];
+}
+
+static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
+					int nid)
 {
 	unsigned long min_count, ret;
 
@@ -1259,19 +1282,26 @@ static unsigned long set_max_huge_pages(
 	 * within all the constraints specified by the sysctls.
 	 */
 	spin_lock(&hugetlb_lock);
-	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, -1))
+	while (h->surplus_huge_pages && count > persistent_huge_pages(h, nid)) {
+		if (nid < 0)
+			ret = adjust_pool_surplus(h, -1);
+		else
+			ret = adjust_pool_surplus_node(h, -1, nid);
+		if (!ret)
 			break;
 	}
 
-	while (count > persistent_huge_pages(h)) {
+	while (count > persistent_huge_pages(h, nid)) {
 		/*
 		 * If this allocation races such that we no longer need the
 		 * page, free_huge_page will handle it by freeing the page
 		 * and reducing the surplus.
 		 */
 		spin_unlock(&hugetlb_lock);
-		ret = alloc_fresh_huge_page(h);
+		if (nid < 0)
+			ret = alloc_fresh_huge_page(h);
+		else
+			ret = alloc_fresh_huge_page_node(h, nid);
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -1293,19 +1323,51 @@ static unsigned long set_max_huge_pages(
 	 * and won't grow the pool anywhere else. Not until one of the
 	 * sysctls are changed, or the surplus pages go out of use.
 	 */
-	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
+	if (nid < 0) {
+		/*
+		 * global min_count = reserve + in-use
+		 */
+		min_count = h->resv_huge_pages +
+				 h->nr_huge_pages - h->free_huge_pages;
+	} else {
+		/*
+		 * per node min_count = "min share of global reserve" +
+		 *     in-use
+		 */
+		long need_reserve = (long)h->resv_huge_pages -
+		         (h->free_huge_pages - h->free_huge_pages_node[nid]);
+		if (need_reserve < 0)
+			need_reserve = 0;
+		min_count =
+		    h->nr_huge_pages_node[nid] - h->free_huge_pages_node[nid] +
+		    need_reserve;
+	}
 	min_count = max(count, min_count);
-	try_to_free_low(h, min_count);
-	while (min_count < persistent_huge_pages(h)) {
-		if (!free_pool_huge_page(h, 0))
+	try_to_free_low(h, min_count, nid);
+	while (min_count < persistent_huge_pages(h, nid)) {
+		if (nid < 0)
+			ret = free_pool_huge_page(h, 0);
+		else
+			ret = hstate_free_huge_page_node(h, 0, nid);
+
+		if (!ret)
 			break;
 	}
-	while (count < persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, 1))
+
+	while (count < persistent_huge_pages(h, nid)) {
+		if (nid < 0)
+			ret = adjust_pool_surplus(h, 1);
+		else
+			ret = adjust_pool_surplus_node(h, 1, nid);
+		if (!ret)
 			break;
 	}
 out:
-	ret = persistent_huge_pages(h);
+
+	/*
+	 * return global persistent huge pages
+	 */
+	ret = persistent_huge_pages(h, -1);
 	spin_unlock(&hugetlb_lock);
 	return ret;
 }
@@ -1320,34 +1382,64 @@ out:
 static struct kobject *hugepages_kobj;
 static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
 
+static int kobj_to_hstate_index(struct kobject *kobj)
+{
+	return kobj->private >> 1;
+}
+
+static int kobj_to_node_id(struct kobject *kobj)
+{
+	int nid = -1;
+
+	if (kobj->private & 1) {
+		int hi = kobj_to_hstate_index(kobj);
+
+		for (nid = 0; nid < nr_node_ids; nid++) {
+			struct node *node = &node_devices[nid];
+			if (node->hstate_kobjs[hi] == kobj)
+				break;
+		}
+		if (nid == nr_node_ids) {
+			BUG();
+			nid = -1;
+		}
+	}
+	return nid;
+}
+
 static struct hstate *kobj_to_hstate(struct kobject *kobj)
 {
-	int i;
-	for (i = 0; i < HUGE_MAX_HSTATE; i++)
-		if (hstate_kobjs[i] == kobj)
-			return &hstates[i];
-	BUG();
-	return NULL;
+	return &hstates[kobj_to_hstate_index(kobj)];
 }
 
 static ssize_t nr_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
 	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->nr_huge_pages);
+	unsigned long nr_huge_pages;
+	int nid = kobj_to_node_id(kobj);
+
+	if (nid < 0)
+		nr_huge_pages = h->nr_huge_pages;
+	else
+		nr_huge_pages = h->nr_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", nr_huge_pages);
 }
+
 static ssize_t nr_hugepages_store(struct kobject *kobj,
 		struct kobj_attribute *attr, const char *buf, size_t count)
 {
-	int err;
 	unsigned long input;
 	struct hstate *h = kobj_to_hstate(kobj);
+	int nid = kobj_to_node_id(kobj);
+	int err;
 
 	err = strict_strtoul(buf, 10, &input);
 	if (err)
 		return 0;
 
-	h->max_huge_pages = set_max_huge_pages(h, input);
+	h->max_huge_pages = set_max_huge_pages(h, input, nid);
 
 	return count;
 }
@@ -1359,6 +1451,7 @@ static ssize_t nr_overcommit_hugepages_s
 	struct hstate *h = kobj_to_hstate(kobj);
 	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
 }
+
 static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
 		struct kobj_attribute *attr, const char *buf, size_t count)
 {
@@ -1382,7 +1475,15 @@ static ssize_t free_hugepages_show(struc
 					struct kobj_attribute *attr, char *buf)
 {
 	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->free_huge_pages);
+	unsigned long free_huge_pages;
+	int nid = kobj_to_node_id(kobj);
+
+	if (nid < 0)
+		free_huge_pages = h->free_huge_pages;
+	else
+		free_huge_pages = h->free_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", free_huge_pages);
 }
 HSTATE_ATTR_RO(free_hugepages);
 
@@ -1398,7 +1499,15 @@ static ssize_t surplus_hugepages_show(st
 					struct kobj_attribute *attr, char *buf)
 {
 	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
+	unsigned long surplus_huge_pages;
+	int nid = kobj_to_node_id(kobj);
+
+	if (nid < 0)
+		surplus_huge_pages = h->surplus_huge_pages;
+	else
+		surplus_huge_pages = h->surplus_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", surplus_huge_pages);
 }
 HSTATE_ATTR_RO(surplus_hugepages);
 
@@ -1415,19 +1524,27 @@ static struct attribute_group hstate_att
 	.attrs = hstate_attrs,
 };
 
-static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
+static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
+				struct kobject *parent,
+				struct kobject **hstate_kobjs,
+				struct attribute_group *hstate_attr_group)
 {
 	int retval;
+	int hi = h - hstates;
 
-	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
-							hugepages_kobj);
-	if (!hstate_kobjs[h - hstates])
+	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
+	if (!hstate_kobjs[hi])
 		return -ENOMEM;
 
-	retval = sysfs_create_group(hstate_kobjs[h - hstates],
-							&hstate_attr_group);
+	/*
+	 * Use kobject private bitfield to save hstate index and to
+	 * indicate per node hstate_kobj for show/store functions
+	 */
+	hstate_kobjs[hi]->private = (hi << 1) | (parent != hugepages_kobj);
+
+	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
 	if (retval)
-		kobject_put(hstate_kobjs[h - hstates]);
+		kobject_put(hstate_kobjs[hi]);
 
 	return retval;
 }
@@ -1442,17 +1559,90 @@ static void __init hugetlb_sysfs_init(vo
 		return;
 
 	for_each_hstate(h) {
-		err = hugetlb_sysfs_add_hstate(h);
+		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
+					 hstate_kobjs, &hstate_attr_group);
 		if (err)
 			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
 								h->name);
 	}
 }
 
+#ifdef CONFIG_NUMA
+static struct attribute *per_node_hstate_attrs[] = {
+	&nr_hugepages_attr.attr,
+	&free_hugepages_attr.attr,
+	&surplus_hugepages_attr.attr,
+	NULL,
+};
+
+static struct attribute_group per_node_hstate_attr_group = {
+	.attrs = per_node_hstate_attrs,
+};
+
+
+void hugetlb_unregister_node(struct node *node)
+{
+	struct hstate *h;
+
+	for_each_hstate(h) {
+		kobject_put(node->hstate_kobjs[h - hstates]);
+		node->hstate_kobjs[h - hstates] = NULL;
+	}
+
+	kobject_put(node->hugepages_kobj);
+	node->hugepages_kobj = NULL;
+}
+
+static void hugetlb_unregister_all_nodes(void)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++)
+		hugetlb_unregister_node(&node_devices[nid]);
+}
+
+void hugetlb_register_node(struct node *node)
+{
+	struct hstate *h;
+	int err;
+
+	if (!hugepages_kobj)
+		return;		/* too early */
+
+	node->hugepages_kobj = kobject_create_and_add("hugepages",
+							&node->sysdev.kobj);
+	if (!node->hugepages_kobj)
+		return;
+
+	for_each_hstate(h) {
+		err = hugetlb_sysfs_add_hstate(h, node->hugepages_kobj,
+						node->hstate_kobjs,
+						&per_node_hstate_attr_group);
+		if (err)
+			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
+					" for node %d\n",
+						h->name, node->sysdev.id);
+	}
+}
+
+static void hugetlb_register_all_nodes(void)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		struct node *node = &node_devices[nid];
+		if (node->sysdev.id == nid && !node->hugepages_kobj)
+			hugetlb_register_node(node);
+	}
+}
+#endif
+
 static void __exit hugetlb_exit(void)
 {
 	struct hstate *h;
 
+	hugetlb_unregister_all_nodes();
+
 	for_each_hstate(h) {
 		kobject_put(hstate_kobjs[h - hstates]);
 	}
@@ -1487,6 +1677,8 @@ static int __init hugetlb_init(void)
 
 	hugetlb_sysfs_init();
 
+	hugetlb_register_all_nodes();
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -1589,7 +1781,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
 	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
 
 	if (write)
-		h->max_huge_pages = set_max_huge_pages(h, tmp);
+		h->max_huge_pages = set_max_huge_pages(h, tmp, -1);
 
 	return 0;
 }
Index: linux-2.6.31-rc3-mmotm-090716-1432/include/linux/node.h
===================================================================
--- linux-2.6.31-rc3-mmotm-090716-1432.orig/include/linux/node.h	2009-07-27 16:23:27.000000000 -0400
+++ linux-2.6.31-rc3-mmotm-090716-1432/include/linux/node.h	2009-07-27 16:23:28.000000000 -0400
@@ -24,6 +24,8 @@
 
 struct node {
 	struct sys_device	sysdev;
+	struct kobject		*hugepages_kobj;
+	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
 };
 
 struct memory_block;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] hugetlb:  add private bit-field to kobject structure
  2009-07-29 18:11   ` Lee Schermerhorn
  (?)
@ 2009-07-29 18:25   ` Greg KH
  2009-07-31 18:59       ` Lee Schermerhorn
  -1 siblings, 1 reply; 31+ messages in thread
From: Greg KH @ 2009-07-29 18:25 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Nishanth Aravamudan,
	andi, David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

On Wed, Jul 29, 2009 at 02:11:58PM -0400, Lee Schermerhorn wrote:
> PATCH/RFC 3/4 hugetlb:  add private bitfield to struct kobject
> 
> Against: 2.6.31-rc3-mmotm-090716-1432
> atop the previously posted alloc_bootmem_hugepages fix.
> [http://marc.info/?l=linux-mm&m=124775468226290&w=4]
> 
> For the per node huge page attributes, we want to share
> as much code as possible with the global huge page attributes,
> including the show/store functions.  To do this, we'll need a
> way to back translate from the kobj argument to the show/store
> function to the node id, when entered via that path.  This
> patch adds a subsystem/sysdev private bitfield to the kobject
> structure.  The bitfield uses unused bits in the same unsigned
> int as the various kobject flags so as not to increase the size
> of the structure. 
> 
> Currently, the bit field is the minimum required for the huge
> pages per node attributes [plus one extra bit].  The field could
> be expanded for other usage, should such arise.
> 
> Note that this is not absolutely required.  However, using this
> private field eliminates an inner loop to scan the per node
> hstate kobjects and eliminates scanning entirely for the global
> hstate kobjects.

Ick, no, please don't do that.  That's what the structure you use to
contain your kobject should be for, right?

Or are you for some reason using "raw" kobjects?

thanks,

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: add per node hstate attributes
  2009-07-29 18:12 ` [PATCH 4/4] hugetlb: add per node hstate attributes Lee Schermerhorn
@ 2009-07-30 19:39     ` David Rientjes
  0 siblings, 0 replies; 31+ messages in thread
From: David Rientjes @ 2009-07-30 19:39 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Greg KH,
	Nishanth Aravamudan, andi, Adam Litke, Andy Whitcroft,
	eric.whitney

On Wed, Jul 29, 2009 at 11:12 AM, Lee
Schermerhorn<lee.schermerhorn@hp.com> wrote:
> PATCH/RFC 4/4 hugetlb:  register per node hugepages attributes
>
> Against: 2.6.31-rc3-mmotm-090716-1432
> atop the previously posted alloc_bootmem_hugepages fix.
> [http://marc.info/?l=linux-mm&m=124775468226290&w=4]
>
> This patch adds the per huge page size control/query attributes
> to the per node sysdevs:
>
> /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
>        nr_hugepages       - r/w
>        free_huge_pages    - r/o
>        surplus_huge_pages - r/o
>
> The patch attempts to re-use/share as much of the existing
> global hstate attribute initialization and handling as possible.
> Throughout, a node id < 0 indicates global hstate parameters.
>
> Note:  computation of "min_count" in set_max_huge_pages() for a
> specified node needs careful review.
>
> Issue:  dependency of base driver [node] dependency on hugetlbfs module.
> We want to keep all of the hstate attribute registration and handling
> in the hugetlb module.  However, we need to call into this code to
> register the per node hstate attributes on node hot plug.
>
> With this patch:
>
> (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
> ./  ../  free_hugepages  nr_hugepages  surplus_hugepages
>
> Starting from:
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:     0
> Node 2 HugePages_Free:      0
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
> vm.nr_hugepages = 0
>
> Allocate 16 persistent huge pages on node 2:
> (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
>
> Yields:
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:    16
> Node 2 HugePages_Free:     16
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
> vm.nr_hugepages = 16
>
> Global controls work as expected--reduce pool to 8 persistent huge pages:
> (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:     8
> Node 2 HugePages_Free:      8
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
>
>
>
>
>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
>

Thank you very much for doing this.

Google is going to need this support regardless of what finally gets
merged into mainline, so I'm thrilled you've implemented this version.

I hugely (get it? hugely :) favor this approach because it's much
simpler to reserve hugepages from this interface than a mempolicy
based approach once hugepages have already been allocated before.  For
cpusets users in particular, jobs typically get allocated on a subset
of nodes that are required for that application and they don't last
for the duration of the machine's uptime.  When a job exits and the
nodes need to be reallocated to a new cpuset, it may be a very
different set of mems based on the memory requirements or interleave
optimizations for the new job.  Allocating resources such as hugepages
are possible in this scenario via mempolicies, but it would require a
temporary mempolicy to then allocate additional hugepages from which
seems like an unnecessary requirement, especially if the job scheduler
that is governing hugepage allocations already has a mempolicy of its
own.

So it's my opinion that the mempolicy based approach is very
appropriate for tasks that allocate hugepages itself.  Other users,
particularly cpusets users, however, would require preallocation of
hugepages prior to a job being scheduled in which case a temporary
mempolicy would be required for that job scheduler.  That seems like
an inconvenience when the entire state of the system's hugepages could
easily be governed with the per-node hstate attributes and a slightly
modified user library.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: add per node hstate attributes
@ 2009-07-30 19:39     ` David Rientjes
  0 siblings, 0 replies; 31+ messages in thread
From: David Rientjes @ 2009-07-30 19:39 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Greg KH,
	Nishanth Aravamudan, andi, Adam Litke, Andy Whitcroft,
	eric.whitney

On Wed, Jul 29, 2009 at 11:12 AM, Lee
Schermerhorn<lee.schermerhorn@hp.com> wrote:
> PATCH/RFC 4/4 hugetlb:  register per node hugepages attributes
>
> Against: 2.6.31-rc3-mmotm-090716-1432
> atop the previously posted alloc_bootmem_hugepages fix.
> [http://marc.info/?l=linux-mm&m=124775468226290&w=4]
>
> This patch adds the per huge page size control/query attributes
> to the per node sysdevs:
>
> /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
>        nr_hugepages       - r/w
>        free_huge_pages    - r/o
>        surplus_huge_pages - r/o
>
> The patch attempts to re-use/share as much of the existing
> global hstate attribute initialization and handling as possible.
> Throughout, a node id < 0 indicates global hstate parameters.
>
> Note:  computation of "min_count" in set_max_huge_pages() for a
> specified node needs careful review.
>
> Issue:  dependency of base driver [node] dependency on hugetlbfs module.
> We want to keep all of the hstate attribute registration and handling
> in the hugetlb module.  However, we need to call into this code to
> register the per node hstate attributes on node hot plug.
>
> With this patch:
>
> (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
> ./  ../  free_hugepages  nr_hugepages  surplus_hugepages
>
> Starting from:
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:     0
> Node 2 HugePages_Free:      0
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
> vm.nr_hugepages = 0
>
> Allocate 16 persistent huge pages on node 2:
> (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
>
> Yields:
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:    16
> Node 2 HugePages_Free:     16
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
> vm.nr_hugepages = 16
>
> Global controls work as expected--reduce pool to 8 persistent huge pages:
> (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>
> Node 0 HugePages_Total:     0
> Node 0 HugePages_Free:      0
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> Node 2 HugePages_Total:     8
> Node 2 HugePages_Free:      8
> Node 2 HugePages_Surp:      0
> Node 3 HugePages_Total:     0
> Node 3 HugePages_Free:      0
> Node 3 HugePages_Surp:      0
>
>
>
>
>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
>

Thank you very much for doing this.

Google is going to need this support regardless of what finally gets
merged into mainline, so I'm thrilled you've implemented this version.

I hugely (get it? hugely :) favor this approach because it's much
simpler to reserve hugepages from this interface than a mempolicy
based approach once hugepages have already been allocated before.  For
cpusets users in particular, jobs typically get allocated on a subset
of nodes that are required for that application and they don't last
for the duration of the machine's uptime.  When a job exits and the
nodes need to be reallocated to a new cpuset, it may be a very
different set of mems based on the memory requirements or interleave
optimizations for the new job.  Allocating resources such as hugepages
are possible in this scenario via mempolicies, but it would require a
temporary mempolicy to then allocate additional hugepages from which
seems like an unnecessary requirement, especially if the job scheduler
that is governing hugepage allocations already has a mempolicy of its
own.

So it's my opinion that the mempolicy based approach is very
appropriate for tasks that allocate hugepages itself.  Other users,
particularly cpusets users, however, would require preallocation of
hugepages prior to a job being scheduled in which case a temporary
mempolicy would be required for that job scheduler.  That seems like
an inconvenience when the entire state of the system's hugepages could
easily be governed with the per-node hstate attributes and a slightly
modified user library.
--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: add per node hstate attributes
  2009-07-30 19:39     ` David Rientjes
@ 2009-07-31 10:36       ` Mel Gorman
  -1 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2009-07-31 10:36 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee Schermerhorn, linux-mm, linux-numa, akpm, Greg KH,
	Nishanth Aravamudan, andi, Adam Litke, Andy Whitcroft,
	eric.whitney

On Thu, Jul 30, 2009 at 12:39:09PM -0700, David Rientjes wrote:
> On Wed, Jul 29, 2009 at 11:12 AM, Lee
> Schermerhorn<lee.schermerhorn@hp.com> wrote:
> > PATCH/RFC 4/4 hugetlb:  register per node hugepages attributes
> >
> > Against: 2.6.31-rc3-mmotm-090716-1432
> > atop the previously posted alloc_bootmem_hugepages fix.
> > [http://marc.info/?l=linux-mm&m=124775468226290&w=4]
> >
> > This patch adds the per huge page size control/query attributes
> > to the per node sysdevs:
> >
> > /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
> >        nr_hugepages       - r/w
> >        free_huge_pages    - r/o
> >        surplus_huge_pages - r/o
> >
> > The patch attempts to re-use/share as much of the existing
> > global hstate attribute initialization and handling as possible.
> > Throughout, a node id < 0 indicates global hstate parameters.
> >
> > Note:  computation of "min_count" in set_max_huge_pages() for a
> > specified node needs careful review.
> >
> > Issue:  dependency of base driver [node] dependency on hugetlbfs module.
> > We want to keep all of the hstate attribute registration and handling
> > in the hugetlb module.  However, we need to call into this code to
> > register the per node hstate attributes on node hot plug.
> >
> > With this patch:
> >
> > (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
> > ./  ../  free_hugepages  nr_hugepages  surplus_hugepages
> >
> > Starting from:
> > Node 0 HugePages_Total:     0
> > Node 0 HugePages_Free:      0
> > Node 0 HugePages_Surp:      0
> > Node 1 HugePages_Total:     0
> > Node 1 HugePages_Free:      0
> > Node 1 HugePages_Surp:      0
> > Node 2 HugePages_Total:     0
> > Node 2 HugePages_Free:      0
> > Node 2 HugePages_Surp:      0
> > Node 3 HugePages_Total:     0
> > Node 3 HugePages_Free:      0
> > Node 3 HugePages_Surp:      0
> > vm.nr_hugepages = 0
> >
> > Allocate 16 persistent huge pages on node 2:
> > (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
> >
> > Yields:
> > Node 0 HugePages_Total:     0
> > Node 0 HugePages_Free:      0
> > Node 0 HugePages_Surp:      0
> > Node 1 HugePages_Total:     0
> > Node 1 HugePages_Free:      0
> > Node 1 HugePages_Surp:      0
> > Node 2 HugePages_Total:    16
> > Node 2 HugePages_Free:     16
> > Node 2 HugePages_Surp:      0
> > Node 3 HugePages_Total:     0
> > Node 3 HugePages_Free:      0
> > Node 3 HugePages_Surp:      0
> > vm.nr_hugepages = 16
> >
> > Global controls work as expected--reduce pool to 8 persistent huge pages:
> > (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> >
> > Node 0 HugePages_Total:     0
> > Node 0 HugePages_Free:      0
> > Node 0 HugePages_Surp:      0
> > Node 1 HugePages_Total:     0
> > Node 1 HugePages_Free:      0
> > Node 1 HugePages_Surp:      0
> > Node 2 HugePages_Total:     8
> > Node 2 HugePages_Free:      8
> > Node 2 HugePages_Surp:      0
> > Node 3 HugePages_Total:     0
> > Node 3 HugePages_Free:      0
> > Node 3 HugePages_Surp:      0
> >
> >
> >
> >
> >
> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> >
> 
> Thank you very much for doing this.
> 
> Google is going to need this support regardless of what finally gets
> merged into mainline, so I'm thrilled you've implemented this version.
> 

The fact that there is a definite use case in mind lends weight to this
approach but I want to be 100% sure that a hugetlbfs-specific interface
is required in this case.

> I hugely (get it? hugely :) favor this approach because it's much
> simpler to reserve hugepages from this interface than a mempolicy
> based approach once hugepages have already been allocated before.  For
> cpusets users in particular, jobs typically get allocated on a subset
> of nodes that are required for that application and they don't last
> for the duration of the machine's uptime.  When a job exits and the
> nodes need to be reallocated to a new cpuset, it may be a very
> different set of mems based on the memory requirements or interleave
> optimizations for the new job.  Allocating resources such as hugepages
> are possible in this scenario via mempolicies, but it would require a
> temporary mempolicy to then allocate additional hugepages from which
> seems like an unnecessary requirement, especially if the job scheduler
> that is governing hugepage allocations already has a mempolicy of its
> own.
> 

I don't know the setup, but lets say something like the following is
happening

1. job scheduler creates cpuset of subset of nodes
2. job scheduler creates memory policy for subset of nodes
3. initialisation job starts, reserves huge pages. If a memory policy is
   already in place, it will reserve them in the correct places
4. Job completes
5. job scheduler frees the pages reserved for the job freeing up pages
   on the subset of nodes

i.e. if the job scheduler already has a memory policy of it's own, or
even some child process of that job scheduler, it should just be able to
set nr_hugepages and have them reserved on the correct nodes.

With the per-node-attribute approach, little stops a process going
outside of it's subset of allowed nodes.

> So it's my opinion that the mempolicy based approach is very
> appropriate for tasks that allocate hugepages itself.  Other users,
> particularly cpusets users, however, would require preallocation of
> hugepages prior to a job being scheduled in which case a temporary
> mempolicy would be required for that job scheduler. 

And why is it such a big difficulty for the job scheduler just to create a
child that does

numactl -m $NODES_SUBSET hugeadm --pool-pages-min 2M:+$PAGES_NEEDED

?

> That seems like
> an inconvenience when the entire state of the system's hugepages could
> easily be governed with the per-node hstate attributes and a slightly
> modified user library.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: add per node hstate attributes
@ 2009-07-31 10:36       ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2009-07-31 10:36 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee Schermerhorn, linux-mm, linux-numa, akpm, Greg KH,
	Nishanth Aravamudan, andi, Adam Litke, Andy Whitcroft,
	eric.whitney

On Thu, Jul 30, 2009 at 12:39:09PM -0700, David Rientjes wrote:
> On Wed, Jul 29, 2009 at 11:12 AM, Lee
> Schermerhorn<lee.schermerhorn@hp.com> wrote:
> > PATCH/RFC 4/4 hugetlb:  register per node hugepages attributes
> >
> > Against: 2.6.31-rc3-mmotm-090716-1432
> > atop the previously posted alloc_bootmem_hugepages fix.
> > [http://marc.info/?l=linux-mm&m=124775468226290&w=4]
> >
> > This patch adds the per huge page size control/query attributes
> > to the per node sysdevs:
> >
> > /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
> >        nr_hugepages       - r/w
> >        free_huge_pages    - r/o
> >        surplus_huge_pages - r/o
> >
> > The patch attempts to re-use/share as much of the existing
> > global hstate attribute initialization and handling as possible.
> > Throughout, a node id < 0 indicates global hstate parameters.
> >
> > Note:  computation of "min_count" in set_max_huge_pages() for a
> > specified node needs careful review.
> >
> > Issue:  dependency of base driver [node] dependency on hugetlbfs module.
> > We want to keep all of the hstate attribute registration and handling
> > in the hugetlb module.  However, we need to call into this code to
> > register the per node hstate attributes on node hot plug.
> >
> > With this patch:
> >
> > (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
> > ./  ../  free_hugepages  nr_hugepages  surplus_hugepages
> >
> > Starting from:
> > Node 0 HugePages_Total:     0
> > Node 0 HugePages_Free:      0
> > Node 0 HugePages_Surp:      0
> > Node 1 HugePages_Total:     0
> > Node 1 HugePages_Free:      0
> > Node 1 HugePages_Surp:      0
> > Node 2 HugePages_Total:     0
> > Node 2 HugePages_Free:      0
> > Node 2 HugePages_Surp:      0
> > Node 3 HugePages_Total:     0
> > Node 3 HugePages_Free:      0
> > Node 3 HugePages_Surp:      0
> > vm.nr_hugepages = 0
> >
> > Allocate 16 persistent huge pages on node 2:
> > (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
> >
> > Yields:
> > Node 0 HugePages_Total:     0
> > Node 0 HugePages_Free:      0
> > Node 0 HugePages_Surp:      0
> > Node 1 HugePages_Total:     0
> > Node 1 HugePages_Free:      0
> > Node 1 HugePages_Surp:      0
> > Node 2 HugePages_Total:    16
> > Node 2 HugePages_Free:     16
> > Node 2 HugePages_Surp:      0
> > Node 3 HugePages_Total:     0
> > Node 3 HugePages_Free:      0
> > Node 3 HugePages_Surp:      0
> > vm.nr_hugepages = 16
> >
> > Global controls work as expected--reduce pool to 8 persistent huge pages:
> > (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> >
> > Node 0 HugePages_Total:     0
> > Node 0 HugePages_Free:      0
> > Node 0 HugePages_Surp:      0
> > Node 1 HugePages_Total:     0
> > Node 1 HugePages_Free:      0
> > Node 1 HugePages_Surp:      0
> > Node 2 HugePages_Total:     8
> > Node 2 HugePages_Free:      8
> > Node 2 HugePages_Surp:      0
> > Node 3 HugePages_Total:     0
> > Node 3 HugePages_Free:      0
> > Node 3 HugePages_Surp:      0
> >
> >
> >
> >
> >
> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> >
> 
> Thank you very much for doing this.
> 
> Google is going to need this support regardless of what finally gets
> merged into mainline, so I'm thrilled you've implemented this version.
> 

The fact that there is a definite use case in mind lends weight to this
approach but I want to be 100% sure that a hugetlbfs-specific interface
is required in this case.

> I hugely (get it? hugely :) favor this approach because it's much
> simpler to reserve hugepages from this interface than a mempolicy
> based approach once hugepages have already been allocated before.  For
> cpusets users in particular, jobs typically get allocated on a subset
> of nodes that are required for that application and they don't last
> for the duration of the machine's uptime.  When a job exits and the
> nodes need to be reallocated to a new cpuset, it may be a very
> different set of mems based on the memory requirements or interleave
> optimizations for the new job.  Allocating resources such as hugepages
> are possible in this scenario via mempolicies, but it would require a
> temporary mempolicy to then allocate additional hugepages from which
> seems like an unnecessary requirement, especially if the job scheduler
> that is governing hugepage allocations already has a mempolicy of its
> own.
> 

I don't know the setup, but lets say something like the following is
happening

1. job scheduler creates cpuset of subset of nodes
2. job scheduler creates memory policy for subset of nodes
3. initialisation job starts, reserves huge pages. If a memory policy is
   already in place, it will reserve them in the correct places
4. Job completes
5. job scheduler frees the pages reserved for the job freeing up pages
   on the subset of nodes

i.e. if the job scheduler already has a memory policy of it's own, or
even some child process of that job scheduler, it should just be able to
set nr_hugepages and have them reserved on the correct nodes.

With the per-node-attribute approach, little stops a process going
outside of it's subset of allowed nodes.

> So it's my opinion that the mempolicy based approach is very
> appropriate for tasks that allocate hugepages itself.  Other users,
> particularly cpusets users, however, would require preallocation of
> hugepages prior to a job being scheduled in which case a temporary
> mempolicy would be required for that job scheduler. 

And why is it such a big difficulty for the job scheduler just to create a
child that does

numactl -m $NODES_SUBSET hugeadm --pool-pages-min 2M:+$PAGES_NEEDED

?

> That seems like
> an inconvenience when the entire state of the system's hugepages could
> easily be governed with the per-node hstate attributes and a slightly
> modified user library.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] hugetlb:  add private bit-field to kobject structure
  2009-07-29 18:25   ` Greg KH
@ 2009-07-31 18:59       ` Lee Schermerhorn
  0 siblings, 0 replies; 31+ messages in thread
From: Lee Schermerhorn @ 2009-07-31 18:59 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Nishanth Aravamudan,
	andi, David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

On Wed, 2009-07-29 at 11:25 -0700, Greg KH wrote:
> On Wed, Jul 29, 2009 at 02:11:58PM -0400, Lee Schermerhorn wrote:
> > PATCH/RFC 3/4 hugetlb:  add private bitfield to struct kobject
> > 
> > Against: 2.6.31-rc3-mmotm-090716-1432
> > atop the previously posted alloc_bootmem_hugepages fix.
> > [http://marc.info/?l=linux-mm&m=124775468226290&w=4]
> > 
> > For the per node huge page attributes, we want to share
> > as much code as possible with the global huge page attributes,
> > including the show/store functions.  To do this, we'll need a
> > way to back translate from the kobj argument to the show/store
> > function to the node id, when entered via that path.  This
> > patch adds a subsystem/sysdev private bitfield to the kobject
> > structure.  The bitfield uses unused bits in the same unsigned
> > int as the various kobject flags so as not to increase the size
> > of the structure. 
> > 
> > Currently, the bit field is the minimum required for the huge
> > pages per node attributes [plus one extra bit].  The field could
> > be expanded for other usage, should such arise.
> > 
> > Note that this is not absolutely required.  However, using this
> > private field eliminates an inner loop to scan the per node
> > hstate kobjects and eliminates scanning entirely for the global
> > hstate kobjects.
> 
> Ick, no, please don't do that.  That's what the structure you use to
> contain your kobject should be for, right?
> 
> Or are you for some reason using "raw" kobjects?

OK, reworked to remove the private bit field from kobject.  This
replaces patches 3 and 4 of 4 in case anyone wants to test the series
w/o the change to the kobject structure.  I'll be sending out another
version of this patch in response to a thread with David and Mel
shortly.

Lee
---

PATCH/RFC 3/3 hugetlb:  register per node hugepages attributes

Against: 2.6.31-rc3-mmotm-090716-1432
atop the previously posted alloc_bootmem_hugepages fix.
[http://marc.info/?l=linux-mm&m=124775468226290&w=4]

V2:  remove dependency on kobject private bitfield.  Search
     global hstates then all per node hstates for kobject
     match in attribute show/store functions.

This patch adds the per huge page size control/query attributes
to the per node sysdevs:

/sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
	nr_hugepages       - r/w
	free_huge_pages    - r/o
	surplus_huge_pages - r/o

The patch attempts to re-use/share as much of the existing
global hstate attribute initialization and handling as possible.
Throughout, a node id < 0 indicates global hstate parameters.

Note:  computation of "min_count" in set_max_huge_pages() for a
specified node needs careful review. 

Issue:  dependency of base driver [node] dependency on hugetlbfs module.
We want to keep all of the hstate attribute registration and handling
in the hugetlb module.  However, we need to call into this code to
register the per node hstate attributes on node hot plug.

With this patch:

(me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
./  ../  free_hugepages  nr_hugepages  surplus_hugepages

Starting from:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     0
Node 2 HugePages_Free:      0
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 0

Allocate 16 persistent huge pages on node 2:
(me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

Yields:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:    16
Node 2 HugePages_Free:     16
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 16

Global controls work as expected--reduce pool to 8 persistent huge pages:
(me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     8
Node 2 HugePages_Free:      8
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0





Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 drivers/base/node.c     |    2 
 include/linux/hugetlb.h |    6 +
 include/linux/node.h    |    3 
 mm/hugetlb.c            |  274 ++++++++++++++++++++++++++++++++++++++++--------
 4 files changed, 243 insertions(+), 42 deletions(-)

Index: linux-2.6.31-rc4-mmotm-090730-0501/drivers/base/node.c
===================================================================
--- linux-2.6.31-rc4-mmotm-090730-0501.orig/drivers/base/node.c	2009-07-30 16:57:34.000000000 -0400
+++ linux-2.6.31-rc4-mmotm-090730-0501/drivers/base/node.c	2009-07-31 11:01:49.000000000 -0400
@@ -200,6 +200,7 @@ int register_node(struct node *node, int
 		sysdev_create_file(&node->sysdev, &attr_distance);
 
 		scan_unevictable_register_node(node);
+		hugetlb_register_node(node);
 	}
 	return error;
 }
@@ -220,6 +221,7 @@ void unregister_node(struct node *node)
 	sysdev_remove_file(&node->sysdev, &attr_distance);
 
 	scan_unevictable_unregister_node(node);
+	hugetlb_unregister_node(node);
 
 	sysdev_unregister(&node->sysdev);
 }
Index: linux-2.6.31-rc4-mmotm-090730-0501/include/linux/hugetlb.h
===================================================================
--- linux-2.6.31-rc4-mmotm-090730-0501.orig/include/linux/hugetlb.h	2009-07-30 16:57:33.000000000 -0400
+++ linux-2.6.31-rc4-mmotm-090730-0501/include/linux/hugetlb.h	2009-07-31 10:17:37.000000000 -0400
@@ -278,6 +278,10 @@ static inline struct hstate *page_hstate
 	return size_to_hstate(PAGE_SIZE << compound_order(page));
 }
 
+struct node;
+extern void hugetlb_register_node(struct node *);
+extern void hugetlb_unregister_node(struct node *);
+
 #else
 struct hstate {};
 #define alloc_bootmem_huge_page(h) NULL
@@ -294,6 +298,8 @@ static inline unsigned int pages_per_hug
 {
 	return 1;
 }
+#define hugetlb_register_node(NP)
+#define hugetlb_unregister_node(NP)
 #endif
 
 #endif /* _LINUX_HUGETLB_H */
Index: linux-2.6.31-rc4-mmotm-090730-0501/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc4-mmotm-090730-0501.orig/mm/hugetlb.c	2009-07-31 10:15:39.000000000 -0400
+++ linux-2.6.31-rc4-mmotm-090730-0501/mm/hugetlb.c	2009-07-31 10:45:00.000000000 -0400
@@ -24,6 +24,7 @@
 #include <asm/io.h>
 
 #include <linux/hugetlb.h>
+#include <linux/node.h>
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
@@ -1155,14 +1156,22 @@ static void __init report_hugepages(void
 }
 
 #ifdef CONFIG_HIGHMEM
-static void try_to_free_low(struct hstate *h, unsigned long count)
+static void try_to_free_low(struct hstate *h, unsigned long count, int nid)
 {
-	int i;
+	int i, start_i, max_i;
 
 	if (h->order >= MAX_ORDER)
 		return;
 
-	for (i = 0; i < MAX_NUMNODES; ++i) {
+	if (nid < 0) {
+		start_i = 0;
+		max_i = MAX_NUMNODES;
+	} else {
+		start_i = nid;
+		max_i = nid + 1;
+	}
+
+	for (i = start_i; i < max_i; ++i) {
 		struct page *page, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
 		list_for_each_entry_safe(page, next, freel, lru) {
@@ -1178,7 +1187,8 @@ static void try_to_free_low(struct hstat
 	}
 }
 #else
-static inline void try_to_free_low(struct hstate *h, unsigned long count)
+static inline void try_to_free_low(struct hstate *h, unsigned long count,
+								int nid)
 {
 }
 #endif
@@ -1239,8 +1249,17 @@ static int adjust_pool_surplus(struct hs
 	return ret;
 }
 
-#define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
-static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
+static unsigned long persistent_huge_pages(struct hstate *h, int nid)
+{
+	if (nid < 0)
+		return h->nr_huge_pages - h->surplus_huge_pages;
+	else
+		return h->nr_huge_pages_node[nid] -
+			h->surplus_huge_pages_node[nid];
+}
+
+static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
+					int nid)
 {
 	unsigned long min_count, ret;
 
@@ -1259,19 +1278,26 @@ static unsigned long set_max_huge_pages(
 	 * within all the constraints specified by the sysctls.
 	 */
 	spin_lock(&hugetlb_lock);
-	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, -1))
+	while (h->surplus_huge_pages && count > persistent_huge_pages(h, nid)) {
+		if (nid < 0)
+			ret = adjust_pool_surplus(h, -1);
+		else
+			ret = adjust_pool_surplus_node(h, -1, nid);
+		if (!ret)
 			break;
 	}
 
-	while (count > persistent_huge_pages(h)) {
+	while (count > persistent_huge_pages(h, nid)) {
 		/*
 		 * If this allocation races such that we no longer need the
 		 * page, free_huge_page will handle it by freeing the page
 		 * and reducing the surplus.
 		 */
 		spin_unlock(&hugetlb_lock);
-		ret = alloc_fresh_huge_page(h);
+		if (nid < 0)
+			ret = alloc_fresh_huge_page(h);
+		else
+			ret = alloc_fresh_huge_page_node(h, nid);
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -1293,19 +1319,51 @@ static unsigned long set_max_huge_pages(
 	 * and won't grow the pool anywhere else. Not until one of the
 	 * sysctls are changed, or the surplus pages go out of use.
 	 */
-	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
+	if (nid < 0) {
+		/*
+		 * global min_count = reserve + in-use
+		 */
+		min_count = h->resv_huge_pages +
+				 h->nr_huge_pages - h->free_huge_pages;
+	} else {
+		/*
+		 * per node min_count = "min share of global reserve" +
+		 *     in-use
+		 */
+		long need_reserve = (long)h->resv_huge_pages -
+		         (h->free_huge_pages - h->free_huge_pages_node[nid]);
+		if (need_reserve < 0)
+			need_reserve = 0;
+		min_count =
+		    h->nr_huge_pages_node[nid] - h->free_huge_pages_node[nid] +
+		    need_reserve;
+	}
 	min_count = max(count, min_count);
-	try_to_free_low(h, min_count);
-	while (min_count < persistent_huge_pages(h)) {
-		if (!free_pool_huge_page(h, 0))
+	try_to_free_low(h, min_count, nid);
+	while (min_count < persistent_huge_pages(h, nid)) {
+		if (nid < 0)
+			ret = free_pool_huge_page(h, 0);
+		else
+			ret = hstate_free_huge_page_node(h, 0, nid);
+
+		if (!ret)
 			break;
 	}
-	while (count < persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, 1))
+
+	while (count < persistent_huge_pages(h, nid)) {
+		if (nid < 0)
+			ret = adjust_pool_surplus(h, 1);
+		else
+			ret = adjust_pool_surplus_node(h, 1, nid);
+		if (!ret)
 			break;
 	}
 out:
-	ret = persistent_huge_pages(h);
+
+	/*
+	 * return global persistent huge pages
+	 */
+	ret = persistent_huge_pages(h, -1);
 	spin_unlock(&hugetlb_lock);
 	return ret;
 }
@@ -1320,34 +1378,69 @@ out:
 static struct kobject *hugepages_kobj;
 static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
 
-static struct hstate *kobj_to_hstate(struct kobject *kobj)
+static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		struct node *node = &node_devices[nid];
+		int hi;
+		for (hi = 0; hi < HUGE_MAX_HSTATE; hi++)
+			if (node->hstate_kobjs[hi] == kobj) {
+				if (nidp)
+					*nidp = nid;
+				return &hstates[hi];
+			}
+	}
+
+	BUG();
+	return NULL;
+}
+
+static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
 {
 	int i;
+
 	for (i = 0; i < HUGE_MAX_HSTATE; i++)
-		if (hstate_kobjs[i] == kobj)
+		if (hstate_kobjs[i] == kobj) {
+			if (nidp)
+				*nidp = -1;
 			return &hstates[i];
-	BUG();
-	return NULL;
+		}
+
+	return kobj_to_node_hstate(kobj, nidp);
 }
 
 static ssize_t nr_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->nr_huge_pages);
+	struct hstate *h;
+	unsigned long nr_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid < 0)
+		nr_huge_pages = h->nr_huge_pages;
+	else
+		nr_huge_pages = h->nr_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", nr_huge_pages);
 }
+
 static ssize_t nr_hugepages_store(struct kobject *kobj,
 		struct kobj_attribute *attr, const char *buf, size_t count)
 {
-	int err;
 	unsigned long input;
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h;
+	int nid;
+	int err;
 
 	err = strict_strtoul(buf, 10, &input);
 	if (err)
 		return 0;
 
-	h->max_huge_pages = set_max_huge_pages(h, input);
+	h = kobj_to_hstate(kobj, &nid);
+	h->max_huge_pages = set_max_huge_pages(h, input, nid);
 
 	return count;
 }
@@ -1356,15 +1449,17 @@ HSTATE_ATTR(nr_hugepages);
 static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
+
 	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
 }
+
 static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
 		struct kobj_attribute *attr, const char *buf, size_t count)
 {
 	int err;
 	unsigned long input;
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
 
 	err = strict_strtoul(buf, 10, &input);
 	if (err)
@@ -1381,15 +1476,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
 static ssize_t free_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->free_huge_pages);
+	struct hstate *h;
+	unsigned long free_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid < 0)
+		free_huge_pages = h->free_huge_pages;
+	else
+		free_huge_pages = h->free_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", free_huge_pages);
 }
 HSTATE_ATTR_RO(free_hugepages);
 
 static ssize_t resv_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
 	return sprintf(buf, "%lu\n", h->resv_huge_pages);
 }
 HSTATE_ATTR_RO(resv_hugepages);
@@ -1397,8 +1501,17 @@ HSTATE_ATTR_RO(resv_hugepages);
 static ssize_t surplus_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
+	struct hstate *h;
+	unsigned long surplus_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid < 0)
+		surplus_huge_pages = h->surplus_huge_pages;
+	else
+		surplus_huge_pages = h->surplus_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", surplus_huge_pages);
 }
 HSTATE_ATTR_RO(surplus_hugepages);
 
@@ -1415,19 +1528,21 @@ static struct attribute_group hstate_att
 	.attrs = hstate_attrs,
 };
 
-static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
+static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
+				struct kobject *parent,
+				struct kobject **hstate_kobjs,
+				struct attribute_group *hstate_attr_group)
 {
 	int retval;
+	int hi = h - hstates;
 
-	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
-							hugepages_kobj);
-	if (!hstate_kobjs[h - hstates])
+	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
+	if (!hstate_kobjs[hi])
 		return -ENOMEM;
 
-	retval = sysfs_create_group(hstate_kobjs[h - hstates],
-							&hstate_attr_group);
+	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
 	if (retval)
-		kobject_put(hstate_kobjs[h - hstates]);
+		kobject_put(hstate_kobjs[hi]);
 
 	return retval;
 }
@@ -1442,17 +1557,90 @@ static void __init hugetlb_sysfs_init(vo
 		return;
 
 	for_each_hstate(h) {
-		err = hugetlb_sysfs_add_hstate(h);
+		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
+					 hstate_kobjs, &hstate_attr_group);
 		if (err)
 			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
 								h->name);
 	}
 }
 
+#ifdef CONFIG_NUMA
+static struct attribute *per_node_hstate_attrs[] = {
+	&nr_hugepages_attr.attr,
+	&free_hugepages_attr.attr,
+	&surplus_hugepages_attr.attr,
+	NULL,
+};
+
+static struct attribute_group per_node_hstate_attr_group = {
+	.attrs = per_node_hstate_attrs,
+};
+
+
+void hugetlb_unregister_node(struct node *node)
+{
+	struct hstate *h;
+
+	for_each_hstate(h) {
+		kobject_put(node->hstate_kobjs[h - hstates]);
+		node->hstate_kobjs[h - hstates] = NULL;
+	}
+
+	kobject_put(node->hugepages_kobj);
+	node->hugepages_kobj = NULL;
+}
+
+static void hugetlb_unregister_all_nodes(void)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++)
+		hugetlb_unregister_node(&node_devices[nid]);
+}
+
+void hugetlb_register_node(struct node *node)
+{
+	struct hstate *h;
+	int err;
+
+	if (!hugepages_kobj)
+		return;		/* too early */
+
+	node->hugepages_kobj = kobject_create_and_add("hugepages",
+							&node->sysdev.kobj);
+	if (!node->hugepages_kobj)
+		return;
+
+	for_each_hstate(h) {
+		err = hugetlb_sysfs_add_hstate(h, node->hugepages_kobj,
+						node->hstate_kobjs,
+						&per_node_hstate_attr_group);
+		if (err)
+			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
+					" for node %d\n",
+						h->name, node->sysdev.id);
+	}
+}
+
+static void hugetlb_register_all_nodes(void)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		struct node *node = &node_devices[nid];
+		if (node->sysdev.id == nid && !node->hugepages_kobj)
+			hugetlb_register_node(node);
+	}
+}
+#endif
+
 static void __exit hugetlb_exit(void)
 {
 	struct hstate *h;
 
+	hugetlb_unregister_all_nodes();
+
 	for_each_hstate(h) {
 		kobject_put(hstate_kobjs[h - hstates]);
 	}
@@ -1487,6 +1675,8 @@ static int __init hugetlb_init(void)
 
 	hugetlb_sysfs_init();
 
+	hugetlb_register_all_nodes();
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -1589,7 +1779,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
 	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
 
 	if (write)
-		h->max_huge_pages = set_max_huge_pages(h, tmp);
+		h->max_huge_pages = set_max_huge_pages(h, tmp, -1);
 
 	return 0;
 }
Index: linux-2.6.31-rc4-mmotm-090730-0501/include/linux/node.h
===================================================================
--- linux-2.6.31-rc4-mmotm-090730-0501.orig/include/linux/node.h	2009-06-09 23:05:27.000000000 -0400
+++ linux-2.6.31-rc4-mmotm-090730-0501/include/linux/node.h	2009-07-31 10:52:31.000000000 -0400
@@ -21,9 +21,12 @@
 
 #include <linux/sysdev.h>
 #include <linux/cpumask.h>
+#include <linux/hugetlb.h>
 
 struct node {
 	struct sys_device	sysdev;
+	struct kobject		*hugepages_kobj;
+	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
 };
 
 struct memory_block;




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] hugetlb:  add private bit-field to kobject structure
@ 2009-07-31 18:59       ` Lee Schermerhorn
  0 siblings, 0 replies; 31+ messages in thread
From: Lee Schermerhorn @ 2009-07-31 18:59 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-mm, linux-numa, akpm, Mel Gorman, Nishanth Aravamudan,
	andi, David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney

On Wed, 2009-07-29 at 11:25 -0700, Greg KH wrote:
> On Wed, Jul 29, 2009 at 02:11:58PM -0400, Lee Schermerhorn wrote:
> > PATCH/RFC 3/4 hugetlb:  add private bitfield to struct kobject
> > 
> > Against: 2.6.31-rc3-mmotm-090716-1432
> > atop the previously posted alloc_bootmem_hugepages fix.
> > [http://marc.info/?l=linux-mm&m=124775468226290&w=4]
> > 
> > For the per node huge page attributes, we want to share
> > as much code as possible with the global huge page attributes,
> > including the show/store functions.  To do this, we'll need a
> > way to back translate from the kobj argument to the show/store
> > function to the node id, when entered via that path.  This
> > patch adds a subsystem/sysdev private bitfield to the kobject
> > structure.  The bitfield uses unused bits in the same unsigned
> > int as the various kobject flags so as not to increase the size
> > of the structure. 
> > 
> > Currently, the bit field is the minimum required for the huge
> > pages per node attributes [plus one extra bit].  The field could
> > be expanded for other usage, should such arise.
> > 
> > Note that this is not absolutely required.  However, using this
> > private field eliminates an inner loop to scan the per node
> > hstate kobjects and eliminates scanning entirely for the global
> > hstate kobjects.
> 
> Ick, no, please don't do that.  That's what the structure you use to
> contain your kobject should be for, right?
> 
> Or are you for some reason using "raw" kobjects?

OK, reworked to remove the private bit field from kobject.  This
replaces patches 3 and 4 of 4 in case anyone wants to test the series
w/o the change to the kobject structure.  I'll be sending out another
version of this patch in response to a thread with David and Mel
shortly.

Lee
---

PATCH/RFC 3/3 hugetlb:  register per node hugepages attributes

Against: 2.6.31-rc3-mmotm-090716-1432
atop the previously posted alloc_bootmem_hugepages fix.
[http://marc.info/?l=linux-mm&m=124775468226290&w=4]

V2:  remove dependency on kobject private bitfield.  Search
     global hstates then all per node hstates for kobject
     match in attribute show/store functions.

This patch adds the per huge page size control/query attributes
to the per node sysdevs:

/sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
	nr_hugepages       - r/w
	free_huge_pages    - r/o
	surplus_huge_pages - r/o

The patch attempts to re-use/share as much of the existing
global hstate attribute initialization and handling as possible.
Throughout, a node id < 0 indicates global hstate parameters.

Note:  computation of "min_count" in set_max_huge_pages() for a
specified node needs careful review. 

Issue:  dependency of base driver [node] dependency on hugetlbfs module.
We want to keep all of the hstate attribute registration and handling
in the hugetlb module.  However, we need to call into this code to
register the per node hstate attributes on node hot plug.

With this patch:

(me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
./  ../  free_hugepages  nr_hugepages  surplus_hugepages

Starting from:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     0
Node 2 HugePages_Free:      0
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 0

Allocate 16 persistent huge pages on node 2:
(me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

Yields:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:    16
Node 2 HugePages_Free:     16
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 16

Global controls work as expected--reduce pool to 8 persistent huge pages:
(me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     8
Node 2 HugePages_Free:      8
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0





Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 drivers/base/node.c     |    2 
 include/linux/hugetlb.h |    6 +
 include/linux/node.h    |    3 
 mm/hugetlb.c            |  274 ++++++++++++++++++++++++++++++++++++++++--------
 4 files changed, 243 insertions(+), 42 deletions(-)

Index: linux-2.6.31-rc4-mmotm-090730-0501/drivers/base/node.c
===================================================================
--- linux-2.6.31-rc4-mmotm-090730-0501.orig/drivers/base/node.c	2009-07-30 16:57:34.000000000 -0400
+++ linux-2.6.31-rc4-mmotm-090730-0501/drivers/base/node.c	2009-07-31 11:01:49.000000000 -0400
@@ -200,6 +200,7 @@ int register_node(struct node *node, int
 		sysdev_create_file(&node->sysdev, &attr_distance);
 
 		scan_unevictable_register_node(node);
+		hugetlb_register_node(node);
 	}
 	return error;
 }
@@ -220,6 +221,7 @@ void unregister_node(struct node *node)
 	sysdev_remove_file(&node->sysdev, &attr_distance);
 
 	scan_unevictable_unregister_node(node);
+	hugetlb_unregister_node(node);
 
 	sysdev_unregister(&node->sysdev);
 }
Index: linux-2.6.31-rc4-mmotm-090730-0501/include/linux/hugetlb.h
===================================================================
--- linux-2.6.31-rc4-mmotm-090730-0501.orig/include/linux/hugetlb.h	2009-07-30 16:57:33.000000000 -0400
+++ linux-2.6.31-rc4-mmotm-090730-0501/include/linux/hugetlb.h	2009-07-31 10:17:37.000000000 -0400
@@ -278,6 +278,10 @@ static inline struct hstate *page_hstate
 	return size_to_hstate(PAGE_SIZE << compound_order(page));
 }
 
+struct node;
+extern void hugetlb_register_node(struct node *);
+extern void hugetlb_unregister_node(struct node *);
+
 #else
 struct hstate {};
 #define alloc_bootmem_huge_page(h) NULL
@@ -294,6 +298,8 @@ static inline unsigned int pages_per_hug
 {
 	return 1;
 }
+#define hugetlb_register_node(NP)
+#define hugetlb_unregister_node(NP)
 #endif
 
 #endif /* _LINUX_HUGETLB_H */
Index: linux-2.6.31-rc4-mmotm-090730-0501/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc4-mmotm-090730-0501.orig/mm/hugetlb.c	2009-07-31 10:15:39.000000000 -0400
+++ linux-2.6.31-rc4-mmotm-090730-0501/mm/hugetlb.c	2009-07-31 10:45:00.000000000 -0400
@@ -24,6 +24,7 @@
 #include <asm/io.h>
 
 #include <linux/hugetlb.h>
+#include <linux/node.h>
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
@@ -1155,14 +1156,22 @@ static void __init report_hugepages(void
 }
 
 #ifdef CONFIG_HIGHMEM
-static void try_to_free_low(struct hstate *h, unsigned long count)
+static void try_to_free_low(struct hstate *h, unsigned long count, int nid)
 {
-	int i;
+	int i, start_i, max_i;
 
 	if (h->order >= MAX_ORDER)
 		return;
 
-	for (i = 0; i < MAX_NUMNODES; ++i) {
+	if (nid < 0) {
+		start_i = 0;
+		max_i = MAX_NUMNODES;
+	} else {
+		start_i = nid;
+		max_i = nid + 1;
+	}
+
+	for (i = start_i; i < max_i; ++i) {
 		struct page *page, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
 		list_for_each_entry_safe(page, next, freel, lru) {
@@ -1178,7 +1187,8 @@ static void try_to_free_low(struct hstat
 	}
 }
 #else
-static inline void try_to_free_low(struct hstate *h, unsigned long count)
+static inline void try_to_free_low(struct hstate *h, unsigned long count,
+								int nid)
 {
 }
 #endif
@@ -1239,8 +1249,17 @@ static int adjust_pool_surplus(struct hs
 	return ret;
 }
 
-#define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
-static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
+static unsigned long persistent_huge_pages(struct hstate *h, int nid)
+{
+	if (nid < 0)
+		return h->nr_huge_pages - h->surplus_huge_pages;
+	else
+		return h->nr_huge_pages_node[nid] -
+			h->surplus_huge_pages_node[nid];
+}
+
+static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
+					int nid)
 {
 	unsigned long min_count, ret;
 
@@ -1259,19 +1278,26 @@ static unsigned long set_max_huge_pages(
 	 * within all the constraints specified by the sysctls.
 	 */
 	spin_lock(&hugetlb_lock);
-	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, -1))
+	while (h->surplus_huge_pages && count > persistent_huge_pages(h, nid)) {
+		if (nid < 0)
+			ret = adjust_pool_surplus(h, -1);
+		else
+			ret = adjust_pool_surplus_node(h, -1, nid);
+		if (!ret)
 			break;
 	}
 
-	while (count > persistent_huge_pages(h)) {
+	while (count > persistent_huge_pages(h, nid)) {
 		/*
 		 * If this allocation races such that we no longer need the
 		 * page, free_huge_page will handle it by freeing the page
 		 * and reducing the surplus.
 		 */
 		spin_unlock(&hugetlb_lock);
-		ret = alloc_fresh_huge_page(h);
+		if (nid < 0)
+			ret = alloc_fresh_huge_page(h);
+		else
+			ret = alloc_fresh_huge_page_node(h, nid);
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -1293,19 +1319,51 @@ static unsigned long set_max_huge_pages(
 	 * and won't grow the pool anywhere else. Not until one of the
 	 * sysctls are changed, or the surplus pages go out of use.
 	 */
-	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
+	if (nid < 0) {
+		/*
+		 * global min_count = reserve + in-use
+		 */
+		min_count = h->resv_huge_pages +
+				 h->nr_huge_pages - h->free_huge_pages;
+	} else {
+		/*
+		 * per node min_count = "min share of global reserve" +
+		 *     in-use
+		 */
+		long need_reserve = (long)h->resv_huge_pages -
+		         (h->free_huge_pages - h->free_huge_pages_node[nid]);
+		if (need_reserve < 0)
+			need_reserve = 0;
+		min_count =
+		    h->nr_huge_pages_node[nid] - h->free_huge_pages_node[nid] +
+		    need_reserve;
+	}
 	min_count = max(count, min_count);
-	try_to_free_low(h, min_count);
-	while (min_count < persistent_huge_pages(h)) {
-		if (!free_pool_huge_page(h, 0))
+	try_to_free_low(h, min_count, nid);
+	while (min_count < persistent_huge_pages(h, nid)) {
+		if (nid < 0)
+			ret = free_pool_huge_page(h, 0);
+		else
+			ret = hstate_free_huge_page_node(h, 0, nid);
+
+		if (!ret)
 			break;
 	}
-	while (count < persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, 1))
+
+	while (count < persistent_huge_pages(h, nid)) {
+		if (nid < 0)
+			ret = adjust_pool_surplus(h, 1);
+		else
+			ret = adjust_pool_surplus_node(h, 1, nid);
+		if (!ret)
 			break;
 	}
 out:
-	ret = persistent_huge_pages(h);
+
+	/*
+	 * return global persistent huge pages
+	 */
+	ret = persistent_huge_pages(h, -1);
 	spin_unlock(&hugetlb_lock);
 	return ret;
 }
@@ -1320,34 +1378,69 @@ out:
 static struct kobject *hugepages_kobj;
 static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
 
-static struct hstate *kobj_to_hstate(struct kobject *kobj)
+static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		struct node *node = &node_devices[nid];
+		int hi;
+		for (hi = 0; hi < HUGE_MAX_HSTATE; hi++)
+			if (node->hstate_kobjs[hi] == kobj) {
+				if (nidp)
+					*nidp = nid;
+				return &hstates[hi];
+			}
+	}
+
+	BUG();
+	return NULL;
+}
+
+static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
 {
 	int i;
+
 	for (i = 0; i < HUGE_MAX_HSTATE; i++)
-		if (hstate_kobjs[i] == kobj)
+		if (hstate_kobjs[i] == kobj) {
+			if (nidp)
+				*nidp = -1;
 			return &hstates[i];
-	BUG();
-	return NULL;
+		}
+
+	return kobj_to_node_hstate(kobj, nidp);
 }
 
 static ssize_t nr_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->nr_huge_pages);
+	struct hstate *h;
+	unsigned long nr_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid < 0)
+		nr_huge_pages = h->nr_huge_pages;
+	else
+		nr_huge_pages = h->nr_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", nr_huge_pages);
 }
+
 static ssize_t nr_hugepages_store(struct kobject *kobj,
 		struct kobj_attribute *attr, const char *buf, size_t count)
 {
-	int err;
 	unsigned long input;
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h;
+	int nid;
+	int err;
 
 	err = strict_strtoul(buf, 10, &input);
 	if (err)
 		return 0;
 
-	h->max_huge_pages = set_max_huge_pages(h, input);
+	h = kobj_to_hstate(kobj, &nid);
+	h->max_huge_pages = set_max_huge_pages(h, input, nid);
 
 	return count;
 }
@@ -1356,15 +1449,17 @@ HSTATE_ATTR(nr_hugepages);
 static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
+
 	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
 }
+
 static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
 		struct kobj_attribute *attr, const char *buf, size_t count)
 {
 	int err;
 	unsigned long input;
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
 
 	err = strict_strtoul(buf, 10, &input);
 	if (err)
@@ -1381,15 +1476,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
 static ssize_t free_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->free_huge_pages);
+	struct hstate *h;
+	unsigned long free_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid < 0)
+		free_huge_pages = h->free_huge_pages;
+	else
+		free_huge_pages = h->free_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", free_huge_pages);
 }
 HSTATE_ATTR_RO(free_hugepages);
 
 static ssize_t resv_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
 	return sprintf(buf, "%lu\n", h->resv_huge_pages);
 }
 HSTATE_ATTR_RO(resv_hugepages);
@@ -1397,8 +1501,17 @@ HSTATE_ATTR_RO(resv_hugepages);
 static ssize_t surplus_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
+	struct hstate *h;
+	unsigned long surplus_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid < 0)
+		surplus_huge_pages = h->surplus_huge_pages;
+	else
+		surplus_huge_pages = h->surplus_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", surplus_huge_pages);
 }
 HSTATE_ATTR_RO(surplus_hugepages);
 
@@ -1415,19 +1528,21 @@ static struct attribute_group hstate_att
 	.attrs = hstate_attrs,
 };
 
-static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
+static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
+				struct kobject *parent,
+				struct kobject **hstate_kobjs,
+				struct attribute_group *hstate_attr_group)
 {
 	int retval;
+	int hi = h - hstates;
 
-	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
-							hugepages_kobj);
-	if (!hstate_kobjs[h - hstates])
+	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
+	if (!hstate_kobjs[hi])
 		return -ENOMEM;
 
-	retval = sysfs_create_group(hstate_kobjs[h - hstates],
-							&hstate_attr_group);
+	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
 	if (retval)
-		kobject_put(hstate_kobjs[h - hstates]);
+		kobject_put(hstate_kobjs[hi]);
 
 	return retval;
 }
@@ -1442,17 +1557,90 @@ static void __init hugetlb_sysfs_init(vo
 		return;
 
 	for_each_hstate(h) {
-		err = hugetlb_sysfs_add_hstate(h);
+		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
+					 hstate_kobjs, &hstate_attr_group);
 		if (err)
 			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
 								h->name);
 	}
 }
 
+#ifdef CONFIG_NUMA
+static struct attribute *per_node_hstate_attrs[] = {
+	&nr_hugepages_attr.attr,
+	&free_hugepages_attr.attr,
+	&surplus_hugepages_attr.attr,
+	NULL,
+};
+
+static struct attribute_group per_node_hstate_attr_group = {
+	.attrs = per_node_hstate_attrs,
+};
+
+
+void hugetlb_unregister_node(struct node *node)
+{
+	struct hstate *h;
+
+	for_each_hstate(h) {
+		kobject_put(node->hstate_kobjs[h - hstates]);
+		node->hstate_kobjs[h - hstates] = NULL;
+	}
+
+	kobject_put(node->hugepages_kobj);
+	node->hugepages_kobj = NULL;
+}
+
+static void hugetlb_unregister_all_nodes(void)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++)
+		hugetlb_unregister_node(&node_devices[nid]);
+}
+
+void hugetlb_register_node(struct node *node)
+{
+	struct hstate *h;
+	int err;
+
+	if (!hugepages_kobj)
+		return;		/* too early */
+
+	node->hugepages_kobj = kobject_create_and_add("hugepages",
+							&node->sysdev.kobj);
+	if (!node->hugepages_kobj)
+		return;
+
+	for_each_hstate(h) {
+		err = hugetlb_sysfs_add_hstate(h, node->hugepages_kobj,
+						node->hstate_kobjs,
+						&per_node_hstate_attr_group);
+		if (err)
+			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
+					" for node %d\n",
+						h->name, node->sysdev.id);
+	}
+}
+
+static void hugetlb_register_all_nodes(void)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		struct node *node = &node_devices[nid];
+		if (node->sysdev.id == nid && !node->hugepages_kobj)
+			hugetlb_register_node(node);
+	}
+}
+#endif
+
 static void __exit hugetlb_exit(void)
 {
 	struct hstate *h;
 
+	hugetlb_unregister_all_nodes();
+
 	for_each_hstate(h) {
 		kobject_put(hstate_kobjs[h - hstates]);
 	}
@@ -1487,6 +1675,8 @@ static int __init hugetlb_init(void)
 
 	hugetlb_sysfs_init();
 
+	hugetlb_register_all_nodes();
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -1589,7 +1779,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
 	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
 
 	if (write)
-		h->max_huge_pages = set_max_huge_pages(h, tmp);
+		h->max_huge_pages = set_max_huge_pages(h, tmp, -1);
 
 	return 0;
 }
Index: linux-2.6.31-rc4-mmotm-090730-0501/include/linux/node.h
===================================================================
--- linux-2.6.31-rc4-mmotm-090730-0501.orig/include/linux/node.h	2009-06-09 23:05:27.000000000 -0400
+++ linux-2.6.31-rc4-mmotm-090730-0501/include/linux/node.h	2009-07-31 10:52:31.000000000 -0400
@@ -21,9 +21,12 @@
 
 #include <linux/sysdev.h>
 #include <linux/cpumask.h>
+#include <linux/hugetlb.h>
 
 struct node {
 	struct sys_device	sysdev;
+	struct kobject		*hugepages_kobj;
+	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
 };
 
 struct memory_block;





^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: add per node hstate attributes
  2009-07-31 10:36       ` Mel Gorman
@ 2009-07-31 19:10         ` Lee Schermerhorn
  -1 siblings, 0 replies; 31+ messages in thread
From: Lee Schermerhorn @ 2009-07-31 19:10 UTC (permalink / raw)
  To: Mel Gorman, David Rientjes
  Cc: linux-mm, linux-numa, akpm, Greg KH, Nishanth Aravamudan, andi,
	Adam Litke, Andy Whitcroft, eric.whitney

On Fri, 2009-07-31 at 11:36 +0100, Mel Gorman wrote:
> On Thu, Jul 30, 2009 at 12:39:09PM -0700, David Rientjes wrote:
> > On Wed, Jul 29, 2009 at 11:12 AM, Lee
> > Schermerhorn<lee.schermerhorn@hp.com> wrote:
> > > PATCH/RFC 4/4 hugetlb:  register per node hugepages attributes
> > >
> > > Against: 2.6.31-rc3-mmotm-090716-1432
> > > atop the previously posted alloc_bootmem_hugepages fix.
> > > [http://marc.info/?l=linux-mm&m=124775468226290&w=4]
> > >
> > > This patch adds the per huge page size control/query attributes
> > > to the per node sysdevs:
> > >
> > > /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
> > >        nr_hugepages       - r/w
> > >        free_huge_pages    - r/o
> > >        surplus_huge_pages - r/o
> > >
> > > The patch attempts to re-use/share as much of the existing
> > > global hstate attribute initialization and handling as possible.
> > > Throughout, a node id < 0 indicates global hstate parameters.
> > >
> > > Note:  computation of "min_count" in set_max_huge_pages() for a
> > > specified node needs careful review.
> > >
> > > Issue:  dependency of base driver [node] dependency on hugetlbfs module.
> > > We want to keep all of the hstate attribute registration and handling
> > > in the hugetlb module.  However, we need to call into this code to
> > > register the per node hstate attributes on node hot plug.
> > >
> > > With this patch:
> > >
> > > (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
> > > ./  ../  free_hugepages  nr_hugepages  surplus_hugepages
> > >
> > > Starting from:
> > > Node 0 HugePages_Total:     0
> > > Node 0 HugePages_Free:      0
> > > Node 0 HugePages_Surp:      0
> > > Node 1 HugePages_Total:     0
> > > Node 1 HugePages_Free:      0
> > > Node 1 HugePages_Surp:      0
> > > Node 2 HugePages_Total:     0
> > > Node 2 HugePages_Free:      0
> > > Node 2 HugePages_Surp:      0
> > > Node 3 HugePages_Total:     0
> > > Node 3 HugePages_Free:      0
> > > Node 3 HugePages_Surp:      0
> > > vm.nr_hugepages = 0
> > >
> > > Allocate 16 persistent huge pages on node 2:
> > > (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
> > >
> > > Yields:
> > > Node 0 HugePages_Total:     0
> > > Node 0 HugePages_Free:      0
> > > Node 0 HugePages_Surp:      0
> > > Node 1 HugePages_Total:     0
> > > Node 1 HugePages_Free:      0
> > > Node 1 HugePages_Surp:      0
> > > Node 2 HugePages_Total:    16
> > > Node 2 HugePages_Free:     16
> > > Node 2 HugePages_Surp:      0
> > > Node 3 HugePages_Total:     0
> > > Node 3 HugePages_Free:      0
> > > Node 3 HugePages_Surp:      0
> > > vm.nr_hugepages = 16
> > >
> > > Global controls work as expected--reduce pool to 8 persistent huge pages:
> > > (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > >
> > > Node 0 HugePages_Total:     0
> > > Node 0 HugePages_Free:      0
> > > Node 0 HugePages_Surp:      0
> > > Node 1 HugePages_Total:     0
> > > Node 1 HugePages_Free:      0
> > > Node 1 HugePages_Surp:      0
> > > Node 2 HugePages_Total:     8
> > > Node 2 HugePages_Free:      8
> > > Node 2 HugePages_Surp:      0
> > > Node 3 HugePages_Total:     0
> > > Node 3 HugePages_Free:      0
> > > Node 3 HugePages_Surp:      0
> > >
> > >
> > >
> > >
> > >
> > > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > >
> > 
> > Thank you very much for doing this.
> > 
> > Google is going to need this support regardless of what finally gets
> > merged into mainline, so I'm thrilled you've implemented this version.
> > 
> 
> The fact that there is a definite use case in mind lends weight to this
> approach but I want to be 100% sure that a hugetlbfs-specific interface
> is required in this case.
> 
> > I hugely (get it? hugely :) favor this approach because it's much
> > simpler to reserve hugepages from this interface than a mempolicy
> > based approach once hugepages have already been allocated before.  For
> > cpusets users in particular, jobs typically get allocated on a subset
> > of nodes that are required for that application and they don't last
> > for the duration of the machine's uptime.  When a job exits and the
> > nodes need to be reallocated to a new cpuset, it may be a very
> > different set of mems based on the memory requirements or interleave
> > optimizations for the new job.  Allocating resources such as hugepages
> > are possible in this scenario via mempolicies, but it would require a
> > temporary mempolicy to then allocate additional hugepages from which
> > seems like an unnecessary requirement, especially if the job scheduler
> > that is governing hugepage allocations already has a mempolicy of its
> > own.
> > 
> 
> I don't know the setup, but lets say something like the following is
> happening
> 
> 1. job scheduler creates cpuset of subset of nodes
> 2. job scheduler creates memory policy for subset of nodes
> 3. initialisation job starts, reserves huge pages. If a memory policy is
>    already in place, it will reserve them in the correct places
> 4. Job completes
> 5. job scheduler frees the pages reserved for the job freeing up pages
>    on the subset of nodes
> 
> i.e. if the job scheduler already has a memory policy of it's own, or
> even some child process of that job scheduler, it should just be able to
> set nr_hugepages and have them reserved on the correct nodes.
> 
> With the per-node-attribute approach, little stops a process going
> outside of it's subset of allowed nodes.
> 
> > So it's my opinion that the mempolicy based approach is very
> > appropriate for tasks that allocate hugepages itself.  Other users,
> > particularly cpusets users, however, would require preallocation of
> > hugepages prior to a job being scheduled in which case a temporary
> > mempolicy would be required for that job scheduler. 
> 
> And why is it such a big difficulty for the job scheduler just to create a
> child that does
> 
> numactl -m $NODES_SUBSET hugeadm --pool-pages-min 2M:+$PAGES_NEEDED
> 
> ?
> 
> > That seems like
> > an inconvenience when the entire state of the system's hugepages could
> > easily be governed with the per-node hstate attributes and a slightly
> > modified user library.
> > 
> 

It occurred to me that I could easily rebase the per node attributes
atop the mempolicy based nodes_allowed series and support both methods
of control.  I'm not saying that I should, just that I could.  So, I
did.  This applies atop the V3 mempolicy based nodes_allowed series.
[You'll want the patch to restore the kfree of nodes_allowed that I just
sent out.]

I'll be offline for most/all of next week, so perhaps David and Mel
could have a look at this and sort out how they want to proceed.
Andrew is holding off on these series until we can decide on one.

Lee
---

PATCH/RFC 5/4 hugetlb:  register per node hugepages attributes

Against: 2.6.31-rc4-mmotm-090730-0510
and the hugetlb rework and mempolicy-based nodes_allowed
series

V2:  remove dependency on kobject private bitfield.  Search
     global hstates then all per node hstates for kobject
     match in attribute show/store functions.

V3:  rebase atop the mempolicy-based hugepage alloc/free;
     use custom "nodes_allowed" to restrict alloc/free to
     a specific node via per node attributes.  Per node
     attribute overrides mempolicy.  I.e., mempolicy only
     applies to global attributes.

To demonstrate feasibility--if not advisability--of supporting
both mempolicy-based persistent huge page management with per
node "override" attributes.

This patch adds the per huge page size control/query attributes
to the per node sysdevs:

/sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
	nr_hugepages       - r/w
	free_huge_pages    - r/o
	surplus_huge_pages - r/o

The patch attempts to re-use/share as much of the existing
global hstate attribute initialization and handling, and the
"nodes_allowed" constraint processing as possible.
In set_max_huge_pages(), a node id < 0 indicates a change to
global hstate parameters.  In this case, any non-default task
mempolicy will be used to generate the nodes_allowed mask.  A
node id > 0 indicates a node specific update and the count 
argument specifies the target count for the node.  From this
info, we compute the target global count for the hstate and
construct a nodes_allowed node mask contain only the specified
node.  Thus, setting the node specific nr_hugepages via the
per node attribute effectively overrides any task mempolicy.


Issue:  dependency of base driver [node] dependency on hugetlbfs module.
We want to keep all of the hstate attribute registration and handling
in the hugetlb module.  However, we need to call into this code to
register the per node hstate attributes on node hot plug.

With this patch:

(me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
./  ../  free_hugepages  nr_hugepages  surplus_hugepages

Starting from:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     0
Node 2 HugePages_Free:      0
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 0

Allocate 16 persistent huge pages on node 2:
(me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

[Note that this is equivalent to:
	numactl -m 2 hugeadmin --pool-pages-min 2M:+16
]

Yields:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:    16
Node 2 HugePages_Free:     16
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 16

Global controls work as expected--reduce pool to 8 persistent huge pages:
(me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     8
Node 2 HugePages_Free:      8
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0


Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 drivers/base/node.c     |    2 
 include/linux/hugetlb.h |    6 +
 include/linux/node.h    |    3 
 mm/hugetlb.c            |  213 +++++++++++++++++++++++++++++++++++++++++-------
 4 files changed, 197 insertions(+), 27 deletions(-)

Index: linux-2.6.31-rc4-mmotm-090730-0501/drivers/base/node.c
===================================================================
--- linux-2.6.31-rc4-mmotm-090730-0501.orig/drivers/base/node.c	2009-07-31 12:06:28.000000000 -0400
+++ linux-2.6.31-rc4-mmotm-090730-0501/drivers/base/node.c	2009-07-31 12:39:23.000000000 -0400
@@ -200,6 +200,7 @@ int register_node(struct node *node, int
 		sysdev_create_file(&node->sysdev, &attr_distance);
 
 		scan_unevictable_register_node(node);
+		hugetlb_register_node(node);
 	}
 	return error;
 }
@@ -220,6 +221,7 @@ void unregister_node(struct node *node)
 	sysdev_remove_file(&node->sysdev, &attr_distance);
 
 	scan_unevictable_unregister_node(node);
+	hugetlb_unregister_node(node);
 
 	sysdev_unregister(&node->sysdev);
 }
Index: linux-2.6.31-rc4-mmotm-090730-0501/include/linux/hugetlb.h
===================================================================
--- linux-2.6.31-rc4-mmotm-090730-0501.orig/include/linux/hugetlb.h	2009-07-31 12:06:28.000000000 -0400
+++ linux-2.6.31-rc4-mmotm-090730-0501/include/linux/hugetlb.h	2009-07-31 12:39:23.000000000 -0400
@@ -278,6 +278,10 @@ static inline struct hstate *page_hstate
 	return size_to_hstate(PAGE_SIZE << compound_order(page));
 }
 
+struct node;
+extern void hugetlb_register_node(struct node *);
+extern void hugetlb_unregister_node(struct node *);
+
 #else
 struct hstate {};
 #define alloc_bootmem_huge_page(h) NULL
@@ -294,6 +298,8 @@ static inline unsigned int pages_per_hug
 {
 	return 1;
 }
+#define hugetlb_register_node(NP)
+#define hugetlb_unregister_node(NP)
 #endif
 
 #endif /* _LINUX_HUGETLB_H */
Index: linux-2.6.31-rc4-mmotm-090730-0501/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc4-mmotm-090730-0501.orig/mm/hugetlb.c	2009-07-31 12:07:16.000000000 -0400
+++ linux-2.6.31-rc4-mmotm-090730-0501/mm/hugetlb.c	2009-07-31 12:59:57.000000000 -0400
@@ -24,6 +24,7 @@
 #include <asm/io.h>
 
 #include <linux/hugetlb.h>
+#include <linux/node.h>
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
@@ -1253,8 +1254,24 @@ static int adjust_pool_surplus(struct hs
 	return ret;
 }
 
+static nodemask_t *nodes_allowed_from_node(int nid)
+{
+	nodemask_t *nodes_allowed;
+	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
+	if (!nodes_allowed) {
+		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
+			"for huge page allocation.\nFalling back to default.\n",
+			current->comm);
+	} else {
+		nodes_clear(*nodes_allowed);
+		node_set(nid, *nodes_allowed);
+	}
+	return nodes_allowed;
+}
+
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
-static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
+static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
+								int nid)
 {
 	unsigned long min_count, ret;
 	nodemask_t *nodes_allowed;
@@ -1262,7 +1279,17 @@ static unsigned long set_max_huge_pages(
 	if (h->order >= MAX_ORDER)
 		return h->max_huge_pages;
 
-	nodes_allowed = huge_mpol_nodes_allowed();
+	if (nid < 0)
+		nodes_allowed = huge_mpol_nodes_allowed();
+	else {
+		/*
+		 * incoming 'count' is for node 'nid' only, so
+		 * adjust count to global, but restrict alloc/free
+		 * to the specified node.
+		 */
+		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
+		nodes_allowed = nodes_allowed_from_node(nid);
+	}
 
 	/*
 	 * Increase the pool size
@@ -1338,34 +1365,69 @@ out:
 static struct kobject *hugepages_kobj;
 static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
 
-static struct hstate *kobj_to_hstate(struct kobject *kobj)
+static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		struct node *node = &node_devices[nid];
+		int hi;
+		for (hi = 0; hi < HUGE_MAX_HSTATE; hi++)
+			if (node->hstate_kobjs[hi] == kobj) {
+				if (nidp)
+					*nidp = nid;
+				return &hstates[hi];
+			}
+	}
+
+	BUG();
+	return NULL;
+}
+
+static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
 {
 	int i;
+
 	for (i = 0; i < HUGE_MAX_HSTATE; i++)
-		if (hstate_kobjs[i] == kobj)
+		if (hstate_kobjs[i] == kobj) {
+			if (nidp)
+				*nidp = -1;
 			return &hstates[i];
-	BUG();
-	return NULL;
+		}
+
+	return kobj_to_node_hstate(kobj, nidp);
 }
 
 static ssize_t nr_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->nr_huge_pages);
+	struct hstate *h;
+	unsigned long nr_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid < 0)
+		nr_huge_pages = h->nr_huge_pages;
+	else
+		nr_huge_pages = h->nr_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", nr_huge_pages);
 }
+
 static ssize_t nr_hugepages_store(struct kobject *kobj,
 		struct kobj_attribute *attr, const char *buf, size_t count)
 {
-	int err;
 	unsigned long input;
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h;
+	int nid;
+	int err;
 
 	err = strict_strtoul(buf, 10, &input);
 	if (err)
 		return 0;
 
-	h->max_huge_pages = set_max_huge_pages(h, input);
+	h = kobj_to_hstate(kobj, &nid);
+	h->max_huge_pages = set_max_huge_pages(h, input, nid);
 
 	return count;
 }
@@ -1374,15 +1436,17 @@ HSTATE_ATTR(nr_hugepages);
 static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
+
 	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
 }
+
 static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
 		struct kobj_attribute *attr, const char *buf, size_t count)
 {
 	int err;
 	unsigned long input;
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
 
 	err = strict_strtoul(buf, 10, &input);
 	if (err)
@@ -1399,15 +1463,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
 static ssize_t free_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->free_huge_pages);
+	struct hstate *h;
+	unsigned long free_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid < 0)
+		free_huge_pages = h->free_huge_pages;
+	else
+		free_huge_pages = h->free_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", free_huge_pages);
 }
 HSTATE_ATTR_RO(free_hugepages);
 
 static ssize_t resv_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
 	return sprintf(buf, "%lu\n", h->resv_huge_pages);
 }
 HSTATE_ATTR_RO(resv_hugepages);
@@ -1415,8 +1488,17 @@ HSTATE_ATTR_RO(resv_hugepages);
 static ssize_t surplus_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
+	struct hstate *h;
+	unsigned long surplus_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid < 0)
+		surplus_huge_pages = h->surplus_huge_pages;
+	else
+		surplus_huge_pages = h->surplus_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", surplus_huge_pages);
 }
 HSTATE_ATTR_RO(surplus_hugepages);
 
@@ -1433,19 +1515,21 @@ static struct attribute_group hstate_att
 	.attrs = hstate_attrs,
 };
 
-static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
+static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
+				struct kobject *parent,
+				struct kobject **hstate_kobjs,
+				struct attribute_group *hstate_attr_group)
 {
 	int retval;
+	int hi = h - hstates;
 
-	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
-							hugepages_kobj);
-	if (!hstate_kobjs[h - hstates])
+	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
+	if (!hstate_kobjs[hi])
 		return -ENOMEM;
 
-	retval = sysfs_create_group(hstate_kobjs[h - hstates],
-							&hstate_attr_group);
+	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
 	if (retval)
-		kobject_put(hstate_kobjs[h - hstates]);
+		kobject_put(hstate_kobjs[hi]);
 
 	return retval;
 }
@@ -1460,17 +1544,90 @@ static void __init hugetlb_sysfs_init(vo
 		return;
 
 	for_each_hstate(h) {
-		err = hugetlb_sysfs_add_hstate(h);
+		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
+					 hstate_kobjs, &hstate_attr_group);
 		if (err)
 			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
 								h->name);
 	}
 }
 
+#ifdef CONFIG_NUMA
+static struct attribute *per_node_hstate_attrs[] = {
+	&nr_hugepages_attr.attr,
+	&free_hugepages_attr.attr,
+	&surplus_hugepages_attr.attr,
+	NULL,
+};
+
+static struct attribute_group per_node_hstate_attr_group = {
+	.attrs = per_node_hstate_attrs,
+};
+
+
+void hugetlb_unregister_node(struct node *node)
+{
+	struct hstate *h;
+
+	for_each_hstate(h) {
+		kobject_put(node->hstate_kobjs[h - hstates]);
+		node->hstate_kobjs[h - hstates] = NULL;
+	}
+
+	kobject_put(node->hugepages_kobj);
+	node->hugepages_kobj = NULL;
+}
+
+static void hugetlb_unregister_all_nodes(void)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++)
+		hugetlb_unregister_node(&node_devices[nid]);
+}
+
+void hugetlb_register_node(struct node *node)
+{
+	struct hstate *h;
+	int err;
+
+	if (!hugepages_kobj)
+		return;		/* too early */
+
+	node->hugepages_kobj = kobject_create_and_add("hugepages",
+							&node->sysdev.kobj);
+	if (!node->hugepages_kobj)
+		return;
+
+	for_each_hstate(h) {
+		err = hugetlb_sysfs_add_hstate(h, node->hugepages_kobj,
+						node->hstate_kobjs,
+						&per_node_hstate_attr_group);
+		if (err)
+			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
+					" for node %d\n",
+						h->name, node->sysdev.id);
+	}
+}
+
+static void hugetlb_register_all_nodes(void)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		struct node *node = &node_devices[nid];
+		if (node->sysdev.id == nid && !node->hugepages_kobj)
+			hugetlb_register_node(node);
+	}
+}
+#endif
+
 static void __exit hugetlb_exit(void)
 {
 	struct hstate *h;
 
+	hugetlb_unregister_all_nodes();
+
 	for_each_hstate(h) {
 		kobject_put(hstate_kobjs[h - hstates]);
 	}
@@ -1505,6 +1662,8 @@ static int __init hugetlb_init(void)
 
 	hugetlb_sysfs_init();
 
+	hugetlb_register_all_nodes();
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -1607,7 +1766,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
 	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
 
 	if (write)
-		h->max_huge_pages = set_max_huge_pages(h, tmp);
+		h->max_huge_pages = set_max_huge_pages(h, tmp, -1);
 
 	return 0;
 }
Index: linux-2.6.31-rc4-mmotm-090730-0501/include/linux/node.h
===================================================================
--- linux-2.6.31-rc4-mmotm-090730-0501.orig/include/linux/node.h	2009-07-31 12:06:28.000000000 -0400
+++ linux-2.6.31-rc4-mmotm-090730-0501/include/linux/node.h	2009-07-31 12:39:23.000000000 -0400
@@ -21,9 +21,12 @@
 
 #include <linux/sysdev.h>
 #include <linux/cpumask.h>
+#include <linux/hugetlb.h>
 
 struct node {
 	struct sys_device	sysdev;
+	struct kobject		*hugepages_kobj;
+	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
 };
 
 struct memory_block;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: add per node hstate attributes
@ 2009-07-31 19:10         ` Lee Schermerhorn
  0 siblings, 0 replies; 31+ messages in thread
From: Lee Schermerhorn @ 2009-07-31 19:10 UTC (permalink / raw)
  To: Mel Gorman, David Rientjes
  Cc: linux-mm, linux-numa, akpm, Greg KH, Nishanth Aravamudan, andi,
	Adam Litke, Andy Whitcroft, eric.whitney

On Fri, 2009-07-31 at 11:36 +0100, Mel Gorman wrote:
> On Thu, Jul 30, 2009 at 12:39:09PM -0700, David Rientjes wrote:
> > On Wed, Jul 29, 2009 at 11:12 AM, Lee
> > Schermerhorn<lee.schermerhorn@hp.com> wrote:
> > > PATCH/RFC 4/4 hugetlb:  register per node hugepages attributes
> > >
> > > Against: 2.6.31-rc3-mmotm-090716-1432
> > > atop the previously posted alloc_bootmem_hugepages fix.
> > > [http://marc.info/?l=linux-mm&m=124775468226290&w=4]
> > >
> > > This patch adds the per huge page size control/query attributes
> > > to the per node sysdevs:
> > >
> > > /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
> > >        nr_hugepages       - r/w
> > >        free_huge_pages    - r/o
> > >        surplus_huge_pages - r/o
> > >
> > > The patch attempts to re-use/share as much of the existing
> > > global hstate attribute initialization and handling as possible.
> > > Throughout, a node id < 0 indicates global hstate parameters.
> > >
> > > Note:  computation of "min_count" in set_max_huge_pages() for a
> > > specified node needs careful review.
> > >
> > > Issue:  dependency of base driver [node] dependency on hugetlbfs module.
> > > We want to keep all of the hstate attribute registration and handling
> > > in the hugetlb module.  However, we need to call into this code to
> > > register the per node hstate attributes on node hot plug.
> > >
> > > With this patch:
> > >
> > > (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
> > > ./  ../  free_hugepages  nr_hugepages  surplus_hugepages
> > >
> > > Starting from:
> > > Node 0 HugePages_Total:     0
> > > Node 0 HugePages_Free:      0
> > > Node 0 HugePages_Surp:      0
> > > Node 1 HugePages_Total:     0
> > > Node 1 HugePages_Free:      0
> > > Node 1 HugePages_Surp:      0
> > > Node 2 HugePages_Total:     0
> > > Node 2 HugePages_Free:      0
> > > Node 2 HugePages_Surp:      0
> > > Node 3 HugePages_Total:     0
> > > Node 3 HugePages_Free:      0
> > > Node 3 HugePages_Surp:      0
> > > vm.nr_hugepages = 0
> > >
> > > Allocate 16 persistent huge pages on node 2:
> > > (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
> > >
> > > Yields:
> > > Node 0 HugePages_Total:     0
> > > Node 0 HugePages_Free:      0
> > > Node 0 HugePages_Surp:      0
> > > Node 1 HugePages_Total:     0
> > > Node 1 HugePages_Free:      0
> > > Node 1 HugePages_Surp:      0
> > > Node 2 HugePages_Total:    16
> > > Node 2 HugePages_Free:     16
> > > Node 2 HugePages_Surp:      0
> > > Node 3 HugePages_Total:     0
> > > Node 3 HugePages_Free:      0
> > > Node 3 HugePages_Surp:      0
> > > vm.nr_hugepages = 16
> > >
> > > Global controls work as expected--reduce pool to 8 persistent huge pages:
> > > (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > >
> > > Node 0 HugePages_Total:     0
> > > Node 0 HugePages_Free:      0
> > > Node 0 HugePages_Surp:      0
> > > Node 1 HugePages_Total:     0
> > > Node 1 HugePages_Free:      0
> > > Node 1 HugePages_Surp:      0
> > > Node 2 HugePages_Total:     8
> > > Node 2 HugePages_Free:      8
> > > Node 2 HugePages_Surp:      0
> > > Node 3 HugePages_Total:     0
> > > Node 3 HugePages_Free:      0
> > > Node 3 HugePages_Surp:      0
> > >
> > >
> > >
> > >
> > >
> > > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > >
> > 
> > Thank you very much for doing this.
> > 
> > Google is going to need this support regardless of what finally gets
> > merged into mainline, so I'm thrilled you've implemented this version.
> > 
> 
> The fact that there is a definite use case in mind lends weight to this
> approach but I want to be 100% sure that a hugetlbfs-specific interface
> is required in this case.
> 
> > I hugely (get it? hugely :) favor this approach because it's much
> > simpler to reserve hugepages from this interface than a mempolicy
> > based approach once hugepages have already been allocated before.  For
> > cpusets users in particular, jobs typically get allocated on a subset
> > of nodes that are required for that application and they don't last
> > for the duration of the machine's uptime.  When a job exits and the
> > nodes need to be reallocated to a new cpuset, it may be a very
> > different set of mems based on the memory requirements or interleave
> > optimizations for the new job.  Allocating resources such as hugepages
> > are possible in this scenario via mempolicies, but it would require a
> > temporary mempolicy to then allocate additional hugepages from which
> > seems like an unnecessary requirement, especially if the job scheduler
> > that is governing hugepage allocations already has a mempolicy of its
> > own.
> > 
> 
> I don't know the setup, but lets say something like the following is
> happening
> 
> 1. job scheduler creates cpuset of subset of nodes
> 2. job scheduler creates memory policy for subset of nodes
> 3. initialisation job starts, reserves huge pages. If a memory policy is
>    already in place, it will reserve them in the correct places
> 4. Job completes
> 5. job scheduler frees the pages reserved for the job freeing up pages
>    on the subset of nodes
> 
> i.e. if the job scheduler already has a memory policy of it's own, or
> even some child process of that job scheduler, it should just be able to
> set nr_hugepages and have them reserved on the correct nodes.
> 
> With the per-node-attribute approach, little stops a process going
> outside of it's subset of allowed nodes.
> 
> > So it's my opinion that the mempolicy based approach is very
> > appropriate for tasks that allocate hugepages itself.  Other users,
> > particularly cpusets users, however, would require preallocation of
> > hugepages prior to a job being scheduled in which case a temporary
> > mempolicy would be required for that job scheduler. 
> 
> And why is it such a big difficulty for the job scheduler just to create a
> child that does
> 
> numactl -m $NODES_SUBSET hugeadm --pool-pages-min 2M:+$PAGES_NEEDED
> 
> ?
> 
> > That seems like
> > an inconvenience when the entire state of the system's hugepages could
> > easily be governed with the per-node hstate attributes and a slightly
> > modified user library.
> > 
> 

It occurred to me that I could easily rebase the per node attributes
atop the mempolicy based nodes_allowed series and support both methods
of control.  I'm not saying that I should, just that I could.  So, I
did.  This applies atop the V3 mempolicy based nodes_allowed series.
[You'll want the patch to restore the kfree of nodes_allowed that I just
sent out.]

I'll be offline for most/all of next week, so perhaps David and Mel
could have a look at this and sort out how they want to proceed.
Andrew is holding off on these series until we can decide on one.

Lee
---

PATCH/RFC 5/4 hugetlb:  register per node hugepages attributes

Against: 2.6.31-rc4-mmotm-090730-0510
and the hugetlb rework and mempolicy-based nodes_allowed
series

V2:  remove dependency on kobject private bitfield.  Search
     global hstates then all per node hstates for kobject
     match in attribute show/store functions.

V3:  rebase atop the mempolicy-based hugepage alloc/free;
     use custom "nodes_allowed" to restrict alloc/free to
     a specific node via per node attributes.  Per node
     attribute overrides mempolicy.  I.e., mempolicy only
     applies to global attributes.

To demonstrate feasibility--if not advisability--of supporting
both mempolicy-based persistent huge page management with per
node "override" attributes.

This patch adds the per huge page size control/query attributes
to the per node sysdevs:

/sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
	nr_hugepages       - r/w
	free_huge_pages    - r/o
	surplus_huge_pages - r/o

The patch attempts to re-use/share as much of the existing
global hstate attribute initialization and handling, and the
"nodes_allowed" constraint processing as possible.
In set_max_huge_pages(), a node id < 0 indicates a change to
global hstate parameters.  In this case, any non-default task
mempolicy will be used to generate the nodes_allowed mask.  A
node id > 0 indicates a node specific update and the count 
argument specifies the target count for the node.  From this
info, we compute the target global count for the hstate and
construct a nodes_allowed node mask contain only the specified
node.  Thus, setting the node specific nr_hugepages via the
per node attribute effectively overrides any task mempolicy.


Issue:  dependency of base driver [node] dependency on hugetlbfs module.
We want to keep all of the hstate attribute registration and handling
in the hugetlb module.  However, we need to call into this code to
register the per node hstate attributes on node hot plug.

With this patch:

(me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
./  ../  free_hugepages  nr_hugepages  surplus_hugepages

Starting from:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     0
Node 2 HugePages_Free:      0
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 0

Allocate 16 persistent huge pages on node 2:
(me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

[Note that this is equivalent to:
	numactl -m 2 hugeadmin --pool-pages-min 2M:+16
]

Yields:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:    16
Node 2 HugePages_Free:     16
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 16

Global controls work as expected--reduce pool to 8 persistent huge pages:
(me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     8
Node 2 HugePages_Free:      8
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0


Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 drivers/base/node.c     |    2 
 include/linux/hugetlb.h |    6 +
 include/linux/node.h    |    3 
 mm/hugetlb.c            |  213 +++++++++++++++++++++++++++++++++++++++++-------
 4 files changed, 197 insertions(+), 27 deletions(-)

Index: linux-2.6.31-rc4-mmotm-090730-0501/drivers/base/node.c
===================================================================
--- linux-2.6.31-rc4-mmotm-090730-0501.orig/drivers/base/node.c	2009-07-31 12:06:28.000000000 -0400
+++ linux-2.6.31-rc4-mmotm-090730-0501/drivers/base/node.c	2009-07-31 12:39:23.000000000 -0400
@@ -200,6 +200,7 @@ int register_node(struct node *node, int
 		sysdev_create_file(&node->sysdev, &attr_distance);
 
 		scan_unevictable_register_node(node);
+		hugetlb_register_node(node);
 	}
 	return error;
 }
@@ -220,6 +221,7 @@ void unregister_node(struct node *node)
 	sysdev_remove_file(&node->sysdev, &attr_distance);
 
 	scan_unevictable_unregister_node(node);
+	hugetlb_unregister_node(node);
 
 	sysdev_unregister(&node->sysdev);
 }
Index: linux-2.6.31-rc4-mmotm-090730-0501/include/linux/hugetlb.h
===================================================================
--- linux-2.6.31-rc4-mmotm-090730-0501.orig/include/linux/hugetlb.h	2009-07-31 12:06:28.000000000 -0400
+++ linux-2.6.31-rc4-mmotm-090730-0501/include/linux/hugetlb.h	2009-07-31 12:39:23.000000000 -0400
@@ -278,6 +278,10 @@ static inline struct hstate *page_hstate
 	return size_to_hstate(PAGE_SIZE << compound_order(page));
 }
 
+struct node;
+extern void hugetlb_register_node(struct node *);
+extern void hugetlb_unregister_node(struct node *);
+
 #else
 struct hstate {};
 #define alloc_bootmem_huge_page(h) NULL
@@ -294,6 +298,8 @@ static inline unsigned int pages_per_hug
 {
 	return 1;
 }
+#define hugetlb_register_node(NP)
+#define hugetlb_unregister_node(NP)
 #endif
 
 #endif /* _LINUX_HUGETLB_H */
Index: linux-2.6.31-rc4-mmotm-090730-0501/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc4-mmotm-090730-0501.orig/mm/hugetlb.c	2009-07-31 12:07:16.000000000 -0400
+++ linux-2.6.31-rc4-mmotm-090730-0501/mm/hugetlb.c	2009-07-31 12:59:57.000000000 -0400
@@ -24,6 +24,7 @@
 #include <asm/io.h>
 
 #include <linux/hugetlb.h>
+#include <linux/node.h>
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
@@ -1253,8 +1254,24 @@ static int adjust_pool_surplus(struct hs
 	return ret;
 }
 
+static nodemask_t *nodes_allowed_from_node(int nid)
+{
+	nodemask_t *nodes_allowed;
+	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
+	if (!nodes_allowed) {
+		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
+			"for huge page allocation.\nFalling back to default.\n",
+			current->comm);
+	} else {
+		nodes_clear(*nodes_allowed);
+		node_set(nid, *nodes_allowed);
+	}
+	return nodes_allowed;
+}
+
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
-static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
+static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
+								int nid)
 {
 	unsigned long min_count, ret;
 	nodemask_t *nodes_allowed;
@@ -1262,7 +1279,17 @@ static unsigned long set_max_huge_pages(
 	if (h->order >= MAX_ORDER)
 		return h->max_huge_pages;
 
-	nodes_allowed = huge_mpol_nodes_allowed();
+	if (nid < 0)
+		nodes_allowed = huge_mpol_nodes_allowed();
+	else {
+		/*
+		 * incoming 'count' is for node 'nid' only, so
+		 * adjust count to global, but restrict alloc/free
+		 * to the specified node.
+		 */
+		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
+		nodes_allowed = nodes_allowed_from_node(nid);
+	}
 
 	/*
 	 * Increase the pool size
@@ -1338,34 +1365,69 @@ out:
 static struct kobject *hugepages_kobj;
 static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
 
-static struct hstate *kobj_to_hstate(struct kobject *kobj)
+static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		struct node *node = &node_devices[nid];
+		int hi;
+		for (hi = 0; hi < HUGE_MAX_HSTATE; hi++)
+			if (node->hstate_kobjs[hi] == kobj) {
+				if (nidp)
+					*nidp = nid;
+				return &hstates[hi];
+			}
+	}
+
+	BUG();
+	return NULL;
+}
+
+static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
 {
 	int i;
+
 	for (i = 0; i < HUGE_MAX_HSTATE; i++)
-		if (hstate_kobjs[i] == kobj)
+		if (hstate_kobjs[i] == kobj) {
+			if (nidp)
+				*nidp = -1;
 			return &hstates[i];
-	BUG();
-	return NULL;
+		}
+
+	return kobj_to_node_hstate(kobj, nidp);
 }
 
 static ssize_t nr_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->nr_huge_pages);
+	struct hstate *h;
+	unsigned long nr_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid < 0)
+		nr_huge_pages = h->nr_huge_pages;
+	else
+		nr_huge_pages = h->nr_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", nr_huge_pages);
 }
+
 static ssize_t nr_hugepages_store(struct kobject *kobj,
 		struct kobj_attribute *attr, const char *buf, size_t count)
 {
-	int err;
 	unsigned long input;
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h;
+	int nid;
+	int err;
 
 	err = strict_strtoul(buf, 10, &input);
 	if (err)
 		return 0;
 
-	h->max_huge_pages = set_max_huge_pages(h, input);
+	h = kobj_to_hstate(kobj, &nid);
+	h->max_huge_pages = set_max_huge_pages(h, input, nid);
 
 	return count;
 }
@@ -1374,15 +1436,17 @@ HSTATE_ATTR(nr_hugepages);
 static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
+
 	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
 }
+
 static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
 		struct kobj_attribute *attr, const char *buf, size_t count)
 {
 	int err;
 	unsigned long input;
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
 
 	err = strict_strtoul(buf, 10, &input);
 	if (err)
@@ -1399,15 +1463,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
 static ssize_t free_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->free_huge_pages);
+	struct hstate *h;
+	unsigned long free_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid < 0)
+		free_huge_pages = h->free_huge_pages;
+	else
+		free_huge_pages = h->free_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", free_huge_pages);
 }
 HSTATE_ATTR_RO(free_hugepages);
 
 static ssize_t resv_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
 	return sprintf(buf, "%lu\n", h->resv_huge_pages);
 }
 HSTATE_ATTR_RO(resv_hugepages);
@@ -1415,8 +1488,17 @@ HSTATE_ATTR_RO(resv_hugepages);
 static ssize_t surplus_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
-	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
+	struct hstate *h;
+	unsigned long surplus_huge_pages;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (nid < 0)
+		surplus_huge_pages = h->surplus_huge_pages;
+	else
+		surplus_huge_pages = h->surplus_huge_pages_node[nid];
+
+	return sprintf(buf, "%lu\n", surplus_huge_pages);
 }
 HSTATE_ATTR_RO(surplus_hugepages);
 
@@ -1433,19 +1515,21 @@ static struct attribute_group hstate_att
 	.attrs = hstate_attrs,
 };
 
-static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
+static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
+				struct kobject *parent,
+				struct kobject **hstate_kobjs,
+				struct attribute_group *hstate_attr_group)
 {
 	int retval;
+	int hi = h - hstates;
 
-	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
-							hugepages_kobj);
-	if (!hstate_kobjs[h - hstates])
+	hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
+	if (!hstate_kobjs[hi])
 		return -ENOMEM;
 
-	retval = sysfs_create_group(hstate_kobjs[h - hstates],
-							&hstate_attr_group);
+	retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
 	if (retval)
-		kobject_put(hstate_kobjs[h - hstates]);
+		kobject_put(hstate_kobjs[hi]);
 
 	return retval;
 }
@@ -1460,17 +1544,90 @@ static void __init hugetlb_sysfs_init(vo
 		return;
 
 	for_each_hstate(h) {
-		err = hugetlb_sysfs_add_hstate(h);
+		err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
+					 hstate_kobjs, &hstate_attr_group);
 		if (err)
 			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
 								h->name);
 	}
 }
 
+#ifdef CONFIG_NUMA
+static struct attribute *per_node_hstate_attrs[] = {
+	&nr_hugepages_attr.attr,
+	&free_hugepages_attr.attr,
+	&surplus_hugepages_attr.attr,
+	NULL,
+};
+
+static struct attribute_group per_node_hstate_attr_group = {
+	.attrs = per_node_hstate_attrs,
+};
+
+
+void hugetlb_unregister_node(struct node *node)
+{
+	struct hstate *h;
+
+	for_each_hstate(h) {
+		kobject_put(node->hstate_kobjs[h - hstates]);
+		node->hstate_kobjs[h - hstates] = NULL;
+	}
+
+	kobject_put(node->hugepages_kobj);
+	node->hugepages_kobj = NULL;
+}
+
+static void hugetlb_unregister_all_nodes(void)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++)
+		hugetlb_unregister_node(&node_devices[nid]);
+}
+
+void hugetlb_register_node(struct node *node)
+{
+	struct hstate *h;
+	int err;
+
+	if (!hugepages_kobj)
+		return;		/* too early */
+
+	node->hugepages_kobj = kobject_create_and_add("hugepages",
+							&node->sysdev.kobj);
+	if (!node->hugepages_kobj)
+		return;
+
+	for_each_hstate(h) {
+		err = hugetlb_sysfs_add_hstate(h, node->hugepages_kobj,
+						node->hstate_kobjs,
+						&per_node_hstate_attr_group);
+		if (err)
+			printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
+					" for node %d\n",
+						h->name, node->sysdev.id);
+	}
+}
+
+static void hugetlb_register_all_nodes(void)
+{
+	int nid;
+
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		struct node *node = &node_devices[nid];
+		if (node->sysdev.id == nid && !node->hugepages_kobj)
+			hugetlb_register_node(node);
+	}
+}
+#endif
+
 static void __exit hugetlb_exit(void)
 {
 	struct hstate *h;
 
+	hugetlb_unregister_all_nodes();
+
 	for_each_hstate(h) {
 		kobject_put(hstate_kobjs[h - hstates]);
 	}
@@ -1505,6 +1662,8 @@ static int __init hugetlb_init(void)
 
 	hugetlb_sysfs_init();
 
+	hugetlb_register_all_nodes();
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -1607,7 +1766,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
 	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
 
 	if (write)
-		h->max_huge_pages = set_max_huge_pages(h, tmp);
+		h->max_huge_pages = set_max_huge_pages(h, tmp, -1);
 
 	return 0;
 }
Index: linux-2.6.31-rc4-mmotm-090730-0501/include/linux/node.h
===================================================================
--- linux-2.6.31-rc4-mmotm-090730-0501.orig/include/linux/node.h	2009-07-31 12:06:28.000000000 -0400
+++ linux-2.6.31-rc4-mmotm-090730-0501/include/linux/node.h	2009-07-31 12:39:23.000000000 -0400
@@ -21,9 +21,12 @@
 
 #include <linux/sysdev.h>
 #include <linux/cpumask.h>
+#include <linux/hugetlb.h>
 
 struct node {
 	struct sys_device	sysdev;
+	struct kobject		*hugepages_kobj;
+	struct kobject		*hstate_kobjs[HUGE_MAX_HSTATE];
 };
 
 struct memory_block;



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: add per node hstate attributes
  2009-07-31 10:36       ` Mel Gorman
@ 2009-07-31 19:55         ` David Rientjes
  -1 siblings, 0 replies; 31+ messages in thread
From: David Rientjes @ 2009-07-31 19:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, linux-mm, linux-numa, Andrew Morton, Greg KH,
	Nishanth Aravamudan, Andi Kleen, Adam Litke, Andy Whitcroft,
	eric.whitney

On Fri, 31 Jul 2009, Mel Gorman wrote:

> > Google is going to need this support regardless of what finally gets
> > merged into mainline, so I'm thrilled you've implemented this version.
> > 
> 
> The fact that there is a definite use case in mind lends weight to this
> approach but I want to be 100% sure that a hugetlbfs-specific interface
> is required in this case.
> 

It's not necessarily required over the mempolicy approach for allocation 
since it's quite simple to just do

	numactl --membind nodemask echo 10 >			\
		/sys/kernel/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

on the nodemask for which you want to allocate 10 additional hugepages 
(or, if node-targeted allocations are really necessary, to use
numactl --preferred node in succession to get a balanced interleave, for 
example.)

> I don't know the setup, but lets say something like the following is
> happening
> 
> 1. job scheduler creates cpuset of subset of nodes
> 2. job scheduler creates memory policy for subset of nodes
> 3. initialisation job starts, reserves huge pages. If a memory policy is
>    already in place, it will reserve them in the correct places

This is where per-node nr_hugepages attributes would be helpful.  It may 
not be possible for the desired number of hugepages to be evenly allocated 
on each node in the subset for MPOL_INTERLEAVE.

If the subset is {1, 2, 3}, for instance, it's possible to get hugepage 
quantities on those nodes as {10, 5, 10}.  The preferred userspace 
solution may be to either change its subset of the cpuset nodes to 
allocate 10 hugepages on another node and not use node 2, or to deallocate 
hugepages on nodes 1 and 3 so it matches node 2.

With the per-node nr_hugepages attributes, that's trivial.  With the 
mempolicy based approach, you'd need to do this (I guess):

 - to change the subset of cpuset nodes: construct a mempolicy of
   MPOL_PREFERRED on node 2, deallocate via the global nr_hugepages file,
   select (or allocate) another cpuset node, construct another mempolicy
   of MPOL_PREFERRED on that new node, allocate, check, reiterate, and

 - to deallocate on nodes 1 and 3: construct a mempolicy of MPOL_BIND on
   nodes 1 and 3, deallocate via the global nr_hugepages.

I'm not sure at the moment that mempolicies work in freeing hugepages via 
/sys/kernel/mm/hugepages/*/nr_hugepags and it isn't simply a round-robin, 
so the second solution may not even work.

> 4. Job completes
> 5. job scheduler frees the pages reserved for the job freeing up pages
>    on the subset of nodes
> 
> i.e. if the job scheduler already has a memory policy of it's own, or
> even some child process of that job scheduler, it should just be able to
> set nr_hugepages and have them reserved on the correct nodes.
> 

Right, allocation is simple with the mempolicy based approach, but given 
the fact that hugepages are not always successfully allocated to what 
userspace wants and freeing is more difficult, it's easier to use per-node 
controls.

> With the per-node-attribute approach, little stops a process going
> outside of it's subset of allowed nodes.
> 

If you are allowed the capability to allocate system-wide resources for 
hugepages (and you can change your own mempolicy to MPOL_DEFAULT whenever 
you want, of course), that doesn't seem like an issue.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: add per node hstate attributes
@ 2009-07-31 19:55         ` David Rientjes
  0 siblings, 0 replies; 31+ messages in thread
From: David Rientjes @ 2009-07-31 19:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, linux-mm, linux-numa, Andrew Morton, Greg KH,
	Nishanth Aravamudan, Andi Kleen, Adam Litke, Andy Whitcroft,
	eric.whitney

On Fri, 31 Jul 2009, Mel Gorman wrote:

> > Google is going to need this support regardless of what finally gets
> > merged into mainline, so I'm thrilled you've implemented this version.
> > 
> 
> The fact that there is a definite use case in mind lends weight to this
> approach but I want to be 100% sure that a hugetlbfs-specific interface
> is required in this case.
> 

It's not necessarily required over the mempolicy approach for allocation 
since it's quite simple to just do

	numactl --membind nodemask echo 10 >			\
		/sys/kernel/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

on the nodemask for which you want to allocate 10 additional hugepages 
(or, if node-targeted allocations are really necessary, to use
numactl --preferred node in succession to get a balanced interleave, for 
example.)

> I don't know the setup, but lets say something like the following is
> happening
> 
> 1. job scheduler creates cpuset of subset of nodes
> 2. job scheduler creates memory policy for subset of nodes
> 3. initialisation job starts, reserves huge pages. If a memory policy is
>    already in place, it will reserve them in the correct places

This is where per-node nr_hugepages attributes would be helpful.  It may 
not be possible for the desired number of hugepages to be evenly allocated 
on each node in the subset for MPOL_INTERLEAVE.

If the subset is {1, 2, 3}, for instance, it's possible to get hugepage 
quantities on those nodes as {10, 5, 10}.  The preferred userspace 
solution may be to either change its subset of the cpuset nodes to 
allocate 10 hugepages on another node and not use node 2, or to deallocate 
hugepages on nodes 1 and 3 so it matches node 2.

With the per-node nr_hugepages attributes, that's trivial.  With the 
mempolicy based approach, you'd need to do this (I guess):

 - to change the subset of cpuset nodes: construct a mempolicy of
   MPOL_PREFERRED on node 2, deallocate via the global nr_hugepages file,
   select (or allocate) another cpuset node, construct another mempolicy
   of MPOL_PREFERRED on that new node, allocate, check, reiterate, and

 - to deallocate on nodes 1 and 3: construct a mempolicy of MPOL_BIND on
   nodes 1 and 3, deallocate via the global nr_hugepages.

I'm not sure at the moment that mempolicies work in freeing hugepages via 
/sys/kernel/mm/hugepages/*/nr_hugepags and it isn't simply a round-robin, 
so the second solution may not even work.

> 4. Job completes
> 5. job scheduler frees the pages reserved for the job freeing up pages
>    on the subset of nodes
> 
> i.e. if the job scheduler already has a memory policy of it's own, or
> even some child process of that job scheduler, it should just be able to
> set nr_hugepages and have them reserved on the correct nodes.
> 

Right, allocation is simple with the mempolicy based approach, but given 
the fact that hugepages are not always successfully allocated to what 
userspace wants and freeing is more difficult, it's easier to use per-node 
controls.

> With the per-node-attribute approach, little stops a process going
> outside of it's subset of allowed nodes.
> 

If you are allowed the capability to allocate system-wide resources for 
hugepages (and you can change your own mempolicy to MPOL_DEFAULT whenever 
you want, of course), that doesn't seem like an issue.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: add per node hstate attributes
  2009-07-31 19:10         ` Lee Schermerhorn
  (?)
@ 2009-08-14 22:38         ` David Rientjes
  2009-08-14 23:08           ` Andrew Morton
  2009-08-15 10:08             ` Mel Gorman
  -1 siblings, 2 replies; 31+ messages in thread
From: David Rientjes @ 2009-08-14 22:38 UTC (permalink / raw)
  To: Lee Schermerhorn, Andrew Morton
  Cc: Mel Gorman, linux-mm, linux-numa, Greg KH, Nishanth Aravamudan,
	Andi Kleen, Adam Litke, Andy Whitcroft, eric.whitney

On Fri, 31 Jul 2009, Lee Schermerhorn wrote:

> PATCH/RFC 5/4 hugetlb:  register per node hugepages attributes
> 
> Against: 2.6.31-rc4-mmotm-090730-0510
> and the hugetlb rework and mempolicy-based nodes_allowed
> series
> 

Andrew, Lee, what's the status of this patchset?  I don't see it, or the 
mempolicy support version, in mmotm-2009-08-12-13-55.

I think there are use cases for both the per-node hstate attributes and 
the mempolicy restricted hugepage allocation support and both features can 
co-exist in the kernel.

My particular interest is in the per-node hstate attributes because it 
allows job schedulers to preallocate hugepages in nodes attached to a 
cpuset with ease and allows node-targeted hugepage freeing for balanced 
allocations, which is a prerequisite for effective interleave 
optimizations.

I'd encourage the addition of the per-node hstate attributes to mmotm.  
Thanks Lee for implementing this feature.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: add per node hstate attributes
  2009-08-14 22:38         ` David Rientjes
@ 2009-08-14 23:08           ` Andrew Morton
  2009-08-14 23:19               ` Greg KH
  2009-08-14 23:53               ` David Rientjes
  2009-08-15 10:08             ` Mel Gorman
  1 sibling, 2 replies; 31+ messages in thread
From: Andrew Morton @ 2009-08-14 23:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee.Schermerhorn, mel, linux-mm, linux-numa, gregkh, nacc, andi,
	agl, apw, eric.whitney

On Fri, 14 Aug 2009 15:38:43 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> Andrew, Lee, what's the status of this patchset?

All forgotten about as far as I'm concerned.  It was v1, it had "rfc"
in there and had an "Ick, no, please don't do that" from Greg.  I
assume Greg's OK with the fixed-up version.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: add per node hstate attributes
  2009-08-14 23:08           ` Andrew Morton
@ 2009-08-14 23:19               ` Greg KH
  2009-08-14 23:53               ` David Rientjes
  1 sibling, 0 replies; 31+ messages in thread
From: Greg KH @ 2009-08-14 23:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Lee.Schermerhorn, mel, linux-mm, linux-numa,
	nacc, andi, agl, apw, eric.whitney

On Fri, Aug 14, 2009 at 04:08:30PM -0700, Andrew Morton wrote:
> On Fri, 14 Aug 2009 15:38:43 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
> 
> > Andrew, Lee, what's the status of this patchset?
> 
> All forgotten about as far as I'm concerned.  It was v1, it had "rfc"
> in there and had an "Ick, no, please don't do that" from Greg.  I
> assume Greg's OK with the fixed-up version.

If the fixed up version does not touch the kobject core in any manner,
then yes, I have no objection.

thanks,

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: add per node hstate attributes
@ 2009-08-14 23:19               ` Greg KH
  0 siblings, 0 replies; 31+ messages in thread
From: Greg KH @ 2009-08-14 23:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Lee.Schermerhorn, mel, linux-mm, linux-numa,
	nacc, andi, agl, apw, eric.whitney

On Fri, Aug 14, 2009 at 04:08:30PM -0700, Andrew Morton wrote:
> On Fri, 14 Aug 2009 15:38:43 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
> 
> > Andrew, Lee, what's the status of this patchset?
> 
> All forgotten about as far as I'm concerned.  It was v1, it had "rfc"
> in there and had an "Ick, no, please don't do that" from Greg.  I
> assume Greg's OK with the fixed-up version.

If the fixed up version does not touch the kobject core in any manner,
then yes, I have no objection.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: add per node hstate attributes
  2009-08-14 23:08           ` Andrew Morton
@ 2009-08-14 23:53               ` David Rientjes
  2009-08-14 23:53               ` David Rientjes
  1 sibling, 0 replies; 31+ messages in thread
From: David Rientjes @ 2009-08-14 23:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Lee Schermerhorn, Mel Gorman, linux-mm, linux-numa,
	Greg Kroah-Hartman, nacc, Andi Kleen, agl, apw, eric.whitney

On Fri, 14 Aug 2009, Andrew Morton wrote:

> On Fri, 14 Aug 2009 15:38:43 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
> 
> > Andrew, Lee, what's the status of this patchset?
> 
> All forgotten about as far as I'm concerned.  It was v1, it had "rfc"
> in there and had an "Ick, no, please don't do that" from Greg.  I
> assume Greg's OK with the fixed-up version.
> 
> 

I think Greg's concerns were addressed in the latest revision of the 
patchset, specifically http://marc.info/?l=linux-mm&m=124906676520398.

Maybe the more appropriate question to ask is if Mel has any concerns 
about adding the per-node hstate attributes either as a substitution or as 
a complement to the mempolicy-based allocation approach.  Mel?

Lee, do you have plans to resend the patchset including the modified kobj 
handling?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: add per node hstate attributes
@ 2009-08-14 23:53               ` David Rientjes
  0 siblings, 0 replies; 31+ messages in thread
From: David Rientjes @ 2009-08-14 23:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Lee Schermerhorn, Mel Gorman, linux-mm, linux-numa,
	Greg Kroah-Hartman, nacc, Andi Kleen, agl, apw, eric.whitney

On Fri, 14 Aug 2009, Andrew Morton wrote:

> On Fri, 14 Aug 2009 15:38:43 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
> 
> > Andrew, Lee, what's the status of this patchset?
> 
> All forgotten about as far as I'm concerned.  It was v1, it had "rfc"
> in there and had an "Ick, no, please don't do that" from Greg.  I
> assume Greg's OK with the fixed-up version.
> 
> 

I think Greg's concerns were addressed in the latest revision of the 
patchset, specifically http://marc.info/?l=linux-mm&m=124906676520398.

Maybe the more appropriate question to ask is if Mel has any concerns 
about adding the per-node hstate attributes either as a substitution or as 
a complement to the mempolicy-based allocation approach.  Mel?

Lee, do you have plans to resend the patchset including the modified kobj 
handling?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: add per node hstate attributes
  2009-08-14 22:38         ` David Rientjes
@ 2009-08-15 10:08             ` Mel Gorman
  2009-08-15 10:08             ` Mel Gorman
  1 sibling, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2009-08-15 10:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee Schermerhorn, Andrew Morton, linux-mm, linux-numa, Greg KH,
	Nishanth Aravamudan, Andi Kleen, Adam Litke, Andy Whitcroft,
	eric.whitney

On Fri, Aug 14, 2009 at 03:38:43PM -0700, David Rientjes wrote:
> On Fri, 31 Jul 2009, Lee Schermerhorn wrote:
> 
> > PATCH/RFC 5/4 hugetlb:  register per node hugepages attributes
> > 
> > Against: 2.6.31-rc4-mmotm-090730-0510
> > and the hugetlb rework and mempolicy-based nodes_allowed
> > series
> > 
> 
> Andrew, Lee, what's the status of this patchset?  I don't see it, or the 
> mempolicy support version, in mmotm-2009-08-12-13-55.
> 

Lee went on holidays and I dropped the ball somewhat in that I didn't review
the combined set he posted just before he left. As the two approaches are
not mutually exclusive, my expectation was that at at least one more patchset
would be posted combining both approaches before merging to -mm.

> I think there are use cases for both the per-node hstate attributes and 
> the mempolicy restricted hugepage allocation support and both features can 
> co-exist in the kernel.
> 

Agreed.

> My particular interest is in the per-node hstate attributes because it 
> allows job schedulers to preallocate hugepages in nodes attached to a 
> cpuset with ease and allows node-targeted hugepage freeing for balanced 
> allocations, which is a prerequisite for effective interleave 
> optimizations.
> 
> I'd encourage the addition of the per-node hstate attributes to mmotm.  
> Thanks Lee for implementing this feature.
> 

I'd like to see at least one patchset without the RFCs attached and have
one more read-through before it's merged.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: add per node hstate attributes
@ 2009-08-15 10:08             ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2009-08-15 10:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lee Schermerhorn, Andrew Morton, linux-mm, linux-numa, Greg KH,
	Nishanth Aravamudan, Andi Kleen, Adam Litke, Andy Whitcroft,
	eric.whitney

On Fri, Aug 14, 2009 at 03:38:43PM -0700, David Rientjes wrote:
> On Fri, 31 Jul 2009, Lee Schermerhorn wrote:
> 
> > PATCH/RFC 5/4 hugetlb:  register per node hugepages attributes
> > 
> > Against: 2.6.31-rc4-mmotm-090730-0510
> > and the hugetlb rework and mempolicy-based nodes_allowed
> > series
> > 
> 
> Andrew, Lee, what's the status of this patchset?  I don't see it, or the 
> mempolicy support version, in mmotm-2009-08-12-13-55.
> 

Lee went on holidays and I dropped the ball somewhat in that I didn't review
the combined set he posted just before he left. As the two approaches are
not mutually exclusive, my expectation was that at at least one more patchset
would be posted combining both approaches before merging to -mm.

> I think there are use cases for both the per-node hstate attributes and 
> the mempolicy restricted hugepage allocation support and both features can 
> co-exist in the kernel.
> 

Agreed.

> My particular interest is in the per-node hstate attributes because it 
> allows job schedulers to preallocate hugepages in nodes attached to a 
> cpuset with ease and allows node-targeted hugepage freeing for balanced 
> allocations, which is a prerequisite for effective interleave 
> optimizations.
> 
> I'd encourage the addition of the per-node hstate attributes to mmotm.  
> Thanks Lee for implementing this feature.
> 

I'd like to see at least one patchset without the RFCs attached and have
one more read-through before it's merged.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: add per node hstate attributes
  2009-08-14 23:53               ` David Rientjes
@ 2009-08-17  1:10                 ` Lee Schermerhorn
  -1 siblings, 0 replies; 31+ messages in thread
From: Lee Schermerhorn @ 2009-08-17  1:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-numa,
	Greg Kroah-Hartman, nacc, Andi Kleen, agl, apw, eric.whitney

On Fri, 2009-08-14 at 16:53 -0700, David Rientjes wrote:
> On Fri, 14 Aug 2009, Andrew Morton wrote:
> 
> > On Fri, 14 Aug 2009 15:38:43 -0700 (PDT)
> > David Rientjes <rientjes@google.com> wrote:
> > 
> > > Andrew, Lee, what's the status of this patchset?
> > 
> > All forgotten about as far as I'm concerned.  It was v1, it had "rfc"
> > in there and had an "Ick, no, please don't do that" from Greg.  I
> > assume Greg's OK with the fixed-up version.
> > 
> > 
> 
> I think Greg's concerns were addressed in the latest revision of the 
> patchset, specifically http://marc.info/?l=linux-mm&m=124906676520398.
> 
> Maybe the more appropriate question to ask is if Mel has any concerns 
> about adding the per-node hstate attributes either as a substitution or as 
> a complement to the mempolicy-based allocation approach.  Mel?
> 
> Lee, do you have plans to resend the patchset including the modified kobj 
> handling?

Yes.  I had planned to ping you and Mel, as I hadn't heard back from you
about the combined interfaces.  I think they mesh fairly well, and the
per node attributes have the, perhaps desirable, property of ignoring
any current task mempolicy.  But, I know that some folks don't like a
proliferation of ways to do something.  I'll package up the series [I
need to update the Documentation for the per node attributes] and send
it out as soon as I can get to it.  This week, I'm pretty sure.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: add per node hstate attributes
@ 2009-08-17  1:10                 ` Lee Schermerhorn
  0 siblings, 0 replies; 31+ messages in thread
From: Lee Schermerhorn @ 2009-08-17  1:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-numa,
	Greg Kroah-Hartman, nacc, Andi Kleen, agl, apw, eric.whitney

On Fri, 2009-08-14 at 16:53 -0700, David Rientjes wrote:
> On Fri, 14 Aug 2009, Andrew Morton wrote:
> 
> > On Fri, 14 Aug 2009 15:38:43 -0700 (PDT)
> > David Rientjes <rientjes@google.com> wrote:
> > 
> > > Andrew, Lee, what's the status of this patchset?
> > 
> > All forgotten about as far as I'm concerned.  It was v1, it had "rfc"
> > in there and had an "Ick, no, please don't do that" from Greg.  I
> > assume Greg's OK with the fixed-up version.
> > 
> > 
> 
> I think Greg's concerns were addressed in the latest revision of the 
> patchset, specifically http://marc.info/?l=linux-mm&m=124906676520398.
> 
> Maybe the more appropriate question to ask is if Mel has any concerns 
> about adding the per-node hstate attributes either as a substitution or as 
> a complement to the mempolicy-based allocation approach.  Mel?
> 
> Lee, do you have plans to resend the patchset including the modified kobj 
> handling?

Yes.  I had planned to ping you and Mel, as I hadn't heard back from you
about the combined interfaces.  I think they mesh fairly well, and the
per node attributes have the, perhaps desirable, property of ignoring
any current task mempolicy.  But, I know that some folks don't like a
proliferation of ways to do something.  I'll package up the series [I
need to update the Documentation for the per node attributes] and send
it out as soon as I can get to it.  This week, I'm pretty sure.

Lee


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: add per node hstate attributes
  2009-08-17  1:10                 ` Lee Schermerhorn
@ 2009-08-17 10:07                   ` David Rientjes
  -1 siblings, 0 replies; 31+ messages in thread
From: David Rientjes @ 2009-08-17 10:07 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-numa,
	Greg Kroah-Hartman, nacc, Andi Kleen, agl, apw, eric.whitney

On Sun, 16 Aug 2009, Lee Schermerhorn wrote:

> Yes.  I had planned to ping you and Mel, as I hadn't heard back from you
> about the combined interfaces.  I think they mesh fairly well, and the
> per node attributes have the, perhaps desirable, property of ignoring
> any current task mempolicy.  But, I know that some folks don't like a
> proliferation of ways to do something.

I agree as a matter of general principle, but I don't think this would be 
a good example of it.

I'm struggling to understand exactly how clean the mempolicy-based 
approach would be if an application such as a job scheduler wanted to free 
hugepages only on specific nodes.  Presumably this would require the 
application to create a MPOL_BIND mempolicy to those nodes and write to 
/proc/sys/vm/nr_hugepages, but that may break existing implementations if 
there are no hugepages allocated on the mempolicy's nodes.

> I'll package up the series [I
> need to update the Documentation for the per node attributes] and send
> it out as soon as I can get to it.  This week, I'm pretty sure.
> 

That's good news, thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: add per node hstate attributes
@ 2009-08-17 10:07                   ` David Rientjes
  0 siblings, 0 replies; 31+ messages in thread
From: David Rientjes @ 2009-08-17 10:07 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-numa,
	Greg Kroah-Hartman, nacc, Andi Kleen, agl, apw, eric.whitney

On Sun, 16 Aug 2009, Lee Schermerhorn wrote:

> Yes.  I had planned to ping you and Mel, as I hadn't heard back from you
> about the combined interfaces.  I think they mesh fairly well, and the
> per node attributes have the, perhaps desirable, property of ignoring
> any current task mempolicy.  But, I know that some folks don't like a
> proliferation of ways to do something.

I agree as a matter of general principle, but I don't think this would be 
a good example of it.

I'm struggling to understand exactly how clean the mempolicy-based 
approach would be if an application such as a job scheduler wanted to free 
hugepages only on specific nodes.  Presumably this would require the 
application to create a MPOL_BIND mempolicy to those nodes and write to 
/proc/sys/vm/nr_hugepages, but that may break existing implementations if 
there are no hugepages allocated on the mempolicy's nodes.

> I'll package up the series [I
> need to update the Documentation for the per node attributes] and send
> it out as soon as I can get to it.  This week, I'm pretty sure.
> 

That's good news, thanks!

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2009-08-17 10:07 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-07-29 18:11 [PATCH 0/4] hugetlb: V1 Per Node Hugepages attributes Lee Schermerhorn
2009-07-29 18:11 ` Lee Schermerhorn
2009-07-29 18:11 ` [PATCH 1/4] hugetlb: rework hstate_next_node_* functions Lee Schermerhorn
2009-07-29 18:11 ` [PATCH 2/4] hugetlb: numafy several functions Lee Schermerhorn
2009-07-29 18:11   ` Lee Schermerhorn
2009-07-29 18:11 ` [PATCH 3/4] hugetlb: add private bit-field to kobject structure Lee Schermerhorn
2009-07-29 18:11   ` Lee Schermerhorn
2009-07-29 18:25   ` Greg KH
2009-07-31 18:59     ` Lee Schermerhorn
2009-07-31 18:59       ` Lee Schermerhorn
2009-07-29 18:12 ` [PATCH 4/4] hugetlb: add per node hstate attributes Lee Schermerhorn
2009-07-30 19:39   ` David Rientjes
2009-07-30 19:39     ` David Rientjes
2009-07-31 10:36     ` Mel Gorman
2009-07-31 10:36       ` Mel Gorman
2009-07-31 19:10       ` Lee Schermerhorn
2009-07-31 19:10         ` Lee Schermerhorn
2009-08-14 22:38         ` David Rientjes
2009-08-14 23:08           ` Andrew Morton
2009-08-14 23:19             ` Greg KH
2009-08-14 23:19               ` Greg KH
2009-08-14 23:53             ` David Rientjes
2009-08-14 23:53               ` David Rientjes
2009-08-17  1:10               ` Lee Schermerhorn
2009-08-17  1:10                 ` Lee Schermerhorn
2009-08-17 10:07                 ` David Rientjes
2009-08-17 10:07                   ` David Rientjes
2009-08-15 10:08           ` Mel Gorman
2009-08-15 10:08             ` Mel Gorman
2009-07-31 19:55       ` David Rientjes
2009-07-31 19:55         ` David Rientjes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.