All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] Huge Pages Nodes Allowed
@ 2009-06-16 13:52 Lee Schermerhorn
  2009-06-16 13:52 ` [PATCH 1/5] Free huge pages round robin to balance across nodes Lee Schermerhorn
                   ` (5 more replies)
  0 siblings, 6 replies; 31+ messages in thread
From: Lee Schermerhorn @ 2009-06-16 13:52 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, Adam Litke,
	Andy Whitcroft, eric.whitney

Because of assymmetries in some NUMA platforms, and "interesting"
topologies emerging in the "scale up x86" world, we have need for
better control over the placement of "fresh huge pages".  A while
back Nish Aravamundan floated a series of patches to add per node
controls for allocating pages to the hugepage pool and removing
them.  Nish apparently moved on to other tasks before those patches
were accepted.  I have kept a copy of Nish's patches and have
intended to rebase and test them and resubmit.

In an [off-list] exchange with Mel Gorman, who admits to knowledge
in the huge pages area, I asked his opinion of per node controls
for huge pages and he suggested another approach:  using the mempolicy
of the task that changes nr_hugepages to constrain the fresh huge
page allocations.  I considered this approach but it seemed to me
to be a misuse of mempolicy for populating the huge pages free
pool.  Interleave policy doesn't have same "this node" semantics
that we want and bind policy would require constructing a custom
node mask for node as well as addressing OOM, which we don't want
during fresh huge page allocation.  One could derive a node mask
of allowed nodes for huge pages from the mempolicy of the task
that is modifying nr_hugepages and use that for fresh huge pages
with GFP_THISNODE.  However, if we're not going to use mempolicy
directly--e.g., via alloc_page_current() or alloc_page_vma() [with
yet another on-stack pseudo-vma :(]--I thought it cleaner to
define a "nodes allowed" nodemask for populating the [persistent]
huge pages free pool.

This patch series introduces a [per hugepage size] "sysctl",
hugepages_nodes_allowed, that specifies a nodemask to constrain
the allocation of persistent, fresh huge pages.   The nodemask
may be specified by a sysctl, a sysfs huge pages attribute and
on the kernel boot command line.  

The series includes a patch to free hugepages from the pool in a
"round robin" fashion, interleaved across all on-line nodes to
balance the hugepage pool across nodes.  Nish had a patch to do
this, too.

Together, these changes don't provide the fine grain of control
that per node attributes would.  Specifically, there is no easy
way to reduce the persistent huge page count for a specific node.
I think the degree of control provided by these patches is the
minimal necessary and sufficient for managing the persistent the
huge page pool.  However, with a bit more reorganization,  we
could implement per node controls if others would find that
useful.

For more info, see the patch descriptions and the updated kernel
hugepages documentation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/5] Free huge pages round robin to balance across nodes
  2009-06-16 13:52 [PATCH 0/5] Huge Pages Nodes Allowed Lee Schermerhorn
@ 2009-06-16 13:52 ` Lee Schermerhorn
  2009-06-17 13:18   ` Mel Gorman
  2009-06-16 13:52 ` [PATCH 2/5] Add nodes_allowed members to hugepages hstate struct Lee Schermerhorn
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 31+ messages in thread
From: Lee Schermerhorn @ 2009-06-16 13:52 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, Adam Litke,
	Andy Whitcroft, eric.whitney

[PATCH 1/5] Free huge pages round robin to balance across nodes

Against:  17may09 mmotm

Currently, altho' increasing nr_hugepages will [attempt to]
distribute the new huge pages across all nodes in the system,
reducing nr_hugepages will free or surplus all free pages
from nodes in node id order.  This patch frees huges pages
from nodes in round robin fashion in an attempt to keep
[persistent] hugepage allocates balanced across the nodes.

New function free_pool_huge_page() is modeled on and
performs roughly the inverse of alloc_fresh_huge_page().
Replaces dequeue_huge_page() which now has no callers
and can be removed.

Helper function hstate_next_to_free_node() uses new hstate
member next_to_free_nid to distribute "frees" across all
nodes with huge pages.

I placed this patch first in the series because I think it
[or something similar] should be applied independent of the
rest of the series.  

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/hugetlb.h |    1 
 mm/hugetlb.c            |   68 +++++++++++++++++++++++++++++++++---------------
 2 files changed, 48 insertions(+), 21 deletions(-)

Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h
===================================================================
--- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h	2009-06-04 12:59:29.000000000 -0400
+++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h	2009-06-04 12:59:31.000000000 -0400
@@ -184,6 +184,7 @@ unsigned long hugetlb_get_unmapped_area(
 /* Defines one hugetlb page size */
 struct hstate {
 	int hugetlb_next_nid;
+	int next_to_free_nid;
 	unsigned int order;
 	unsigned long mask;
 	unsigned long max_huge_pages;
Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c
===================================================================
--- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c	2009-06-04 12:59:29.000000000 -0400
+++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c	2009-06-04 12:59:31.000000000 -0400
@@ -455,24 +455,6 @@ static void enqueue_huge_page(struct hst
 	h->free_huge_pages_node[nid]++;
 }
 
-static struct page *dequeue_huge_page(struct hstate *h)
-{
-	int nid;
-	struct page *page = NULL;
-
-	for (nid = 0; nid < MAX_NUMNODES; ++nid) {
-		if (!list_empty(&h->hugepage_freelists[nid])) {
-			page = list_entry(h->hugepage_freelists[nid].next,
-					  struct page, lru);
-			list_del(&page->lru);
-			h->free_huge_pages--;
-			h->free_huge_pages_node[nid]--;
-			break;
-		}
-	}
-	return page;
-}
-
 static struct page *dequeue_huge_page_vma(struct hstate *h,
 				struct vm_area_struct *vma,
 				unsigned long address, int avoid_reserve)
@@ -683,6 +665,52 @@ static int alloc_fresh_huge_page(struct 
 	return ret;
 }
 
+/*
+ * helper for free_pool_huge_page() - find next node
+ * from which to free a huge page
+ */
+static int hstate_next_to_free_node(struct hstate *h)
+{
+	int next_nid;
+	next_nid = next_node(h->next_to_free_nid, node_online_map);
+	if (next_nid == MAX_NUMNODES)
+		next_nid = first_node(node_online_map);
+	h->next_to_free_nid = next_nid;
+	return next_nid;
+}
+
+/*
+ * Free huge page from pool from next node to free.
+ * Attempt to keep persistent huge pages more or less
+ * balanced over allowed nodes.
+ * Called with hugetlb_lock locked.
+ */
+static int free_pool_huge_page(struct hstate *h)
+{
+	int start_nid;
+	int nid;
+	int ret = 0;
+
+	start_nid = h->next_to_free_nid;
+	nid = h->next_to_free_nid;
+
+	do {
+		if (!list_empty(&h->hugepage_freelists[nid])) {
+			struct page *page =
+				list_entry(h->hugepage_freelists[nid].next,
+					  struct page, lru);
+			list_del(&page->lru);
+			h->free_huge_pages--;
+			h->free_huge_pages_node[nid]--;
+			update_and_free_page(h, page);
+			ret = 1;
+		}
+		nid = hstate_next_to_free_node(h);
+	} while (!ret && nid != start_nid);
+
+	return ret;
+}
+
 static struct page *alloc_buddy_huge_page(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long address)
 {
@@ -1226,10 +1254,8 @@ static unsigned long set_max_huge_pages(
 	min_count = max(count, min_count);
 	try_to_free_low(h, min_count);
 	while (min_count < persistent_huge_pages(h)) {
-		struct page *page = dequeue_huge_page(h);
-		if (!page)
+		if (!free_pool_huge_page(h))
 			break;
-		update_and_free_page(h, page);
 	}
 	while (count < persistent_huge_pages(h)) {
 		if (!adjust_pool_surplus(h, 1))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 2/5] Add nodes_allowed members to hugepages hstate struct
  2009-06-16 13:52 [PATCH 0/5] Huge Pages Nodes Allowed Lee Schermerhorn
  2009-06-16 13:52 ` [PATCH 1/5] Free huge pages round robin to balance across nodes Lee Schermerhorn
@ 2009-06-16 13:52 ` Lee Schermerhorn
  2009-06-17 13:35   ` Mel Gorman
  2009-06-16 13:53 ` [PATCH 3/5] Use per hstate nodes_allowed to constrain huge page allocation Lee Schermerhorn
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 31+ messages in thread
From: Lee Schermerhorn @ 2009-06-16 13:52 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, Adam Litke,
	Andy Whitcroft, eric.whitney

[PATCH 2/5] add nodes_allowed members to hugepages hstate struct

Against:  17may09 mmotm

This patch adds a nodes_allowed nodemask_t pointer and a
__nodes_allowed nodemask_t to the hstate struct for constraining
fresh hugepage allocations.  It then adds sysfs attributes and
boot command line options to set and [for sysfs attributes] query
the allowed nodes mask.

A separate patch will hook up this nodes_allowed mask/pointer to
fresh huge page allocation and promoting of surplus pages to
persistent.  Another patch will add a 'sysctl' hugepages_nodes_allowed
to /proc/sys/vm.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/hugetlb.h |    2 +
 mm/hugetlb.c            |   86 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 88 insertions(+)

Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h
===================================================================
--- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h	2009-06-04 12:59:31.000000000 -0400
+++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h	2009-06-04 12:59:32.000000000 -0400
@@ -193,6 +193,8 @@ struct hstate {
 	unsigned long resv_huge_pages;
 	unsigned long surplus_huge_pages;
 	unsigned long nr_overcommit_huge_pages;
+	nodemask_t *nodes_allowed;
+	nodemask_t __nodes_allowed;
 	struct list_head hugepage_freelists[MAX_NUMNODES];
 	unsigned int nr_huge_pages_node[MAX_NUMNODES];
 	unsigned int free_huge_pages_node[MAX_NUMNODES];
Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c
===================================================================
--- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c	2009-06-04 12:59:31.000000000 -0400
+++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c	2009-06-04 12:59:32.000000000 -0400
@@ -40,6 +40,9 @@ __initdata LIST_HEAD(huge_boot_pages);
 static struct hstate * __initdata parsed_hstate;
 static unsigned long __initdata default_hstate_max_huge_pages;
 static unsigned long __initdata default_hstate_size;
+static struct hstate __initdata default_boot_hstate = {
+	.nodes_allowed = &node_online_map,
+};
 
 #define for_each_hstate(h) \
 	for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++)
@@ -1102,6 +1105,9 @@ static void __init hugetlb_init_hstates(
 	struct hstate *h;
 
 	for_each_hstate(h) {
+		if (!h->nodes_allowed)
+			h->nodes_allowed = &node_online_map;
+
 		/* oversize hugepages were init'ed in early boot */
 		if (h->order < MAX_ORDER)
 			hugetlb_hstate_alloc_pages(h);
@@ -1335,6 +1341,62 @@ static ssize_t nr_overcommit_hugepages_s
 }
 HSTATE_ATTR(nr_overcommit_hugepages);
 
+static ssize_t nodes_allowed_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	struct hstate *h = kobj_to_hstate(kobj);
+	int len = 3;
+
+	if (h->nodes_allowed == &node_online_map)
+		strcpy(buf, "all");
+	else
+		len = nodelist_scnprintf(buf, PAGE_SIZE,
+					*h->nodes_allowed);
+
+	if (len)
+		buf[len++] = '\n';
+
+	return len;
+}
+
+static int set_hstate_nodes_allowed(struct hstate *h, const char *buf,
+					bool lock)
+{
+	nodemask_t nodes_allowed;
+	int ret = 1;
+	bool all = !strncasecmp(buf, "all", 3);
+
+	if (!all)
+		ret = !nodelist_parse(buf, nodes_allowed);
+	if (ret) {
+		if (lock)
+			spin_lock(&hugetlb_lock);
+
+		if (all) {
+			h->nodes_allowed = &node_online_map;
+		} else {
+			h->__nodes_allowed = nodes_allowed;
+			h->nodes_allowed = &h->__nodes_allowed;
+		}
+
+		if (lock)
+			spin_unlock(&hugetlb_lock);
+	}
+	return ret;
+}
+
+static ssize_t nodes_allowed_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	struct hstate *h = kobj_to_hstate(kobj);
+
+	if (set_hstate_nodes_allowed(h, buf, 1))
+		return count;
+	else
+		return 0;
+}
+HSTATE_ATTR(nodes_allowed);
+
 static ssize_t free_hugepages_show(struct kobject *kobj,
 					struct kobj_attribute *attr, char *buf)
 {
@@ -1362,6 +1424,7 @@ HSTATE_ATTR_RO(surplus_hugepages);
 static struct attribute *hstate_attrs[] = {
 	&nr_hugepages_attr.attr,
 	&nr_overcommit_hugepages_attr.attr,
+	&nodes_allowed_attr.attr,
 	&free_hugepages_attr.attr,
 	&resv_hugepages_attr.attr,
 	&surplus_hugepages_attr.attr,
@@ -1436,6 +1499,13 @@ static int __init hugetlb_init(void)
 	if (default_hstate_max_huge_pages)
 		default_hstate.max_huge_pages = default_hstate_max_huge_pages;
 
+	if (default_boot_hstate.nodes_allowed != &node_online_map) {
+		default_hstate.__nodes_allowed =
+					default_boot_hstate.__nodes_allowed;
+		default_hstate.nodes_allowed =
+					&default_hstate.__nodes_allowed;
+	}
+
 	hugetlb_init_hstates();
 
 	gather_bootmem_prealloc();
@@ -1471,6 +1541,7 @@ void __init hugetlb_add_hstate(unsigned 
 	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
 					huge_page_size(h)/1024);
 
+	h->nodes_allowed = &node_online_map;
 	parsed_hstate = h;
 }
 
@@ -1518,6 +1589,21 @@ static int __init hugetlb_default_setup(
 }
 __setup("default_hugepagesz=", hugetlb_default_setup);
 
+static int __init hugetlb_nodes_allowed_setup(char *s)
+{
+	struct hstate *h = &default_boot_hstate;
+
+	/*
+	 * max_hstate means we've parsed a hugepagesz= parameter, so
+	 * use the [most recently] parsed_hstate.  Else use default.
+	 */
+	if (max_hstate)
+		h = parsed_hstate;
+
+	return set_hstate_nodes_allowed(h, s, 0);
+}
+__setup("hugepages_nodes_allowed=", hugetlb_nodes_allowed_setup);
+
 static unsigned int cpuset_mems_nr(unsigned int *array)
 {
 	int node;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 3/5] Use per hstate nodes_allowed to constrain huge page allocation
  2009-06-16 13:52 [PATCH 0/5] Huge Pages Nodes Allowed Lee Schermerhorn
  2009-06-16 13:52 ` [PATCH 1/5] Free huge pages round robin to balance across nodes Lee Schermerhorn
  2009-06-16 13:52 ` [PATCH 2/5] Add nodes_allowed members to hugepages hstate struct Lee Schermerhorn
@ 2009-06-16 13:53 ` Lee Schermerhorn
  2009-06-17 13:39   ` Mel Gorman
  2009-06-16 13:53 ` [PATCH 4/5] Add sysctl for default hstate nodes_allowed Lee Schermerhorn
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 31+ messages in thread
From: Lee Schermerhorn @ 2009-06-16 13:53 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, Adam Litke,
	Andy Whitcroft, eric.whitney

[PATCH 3/5] Use per hstate nodes_allowed to constrain huge page allocation

Against:  17may09 mmotm

Select only nodes from the per hstate nodes_allowed mask when
promoting surplus pages to persistent or when allocating fresh
huge pages to the pool.

Note that alloc_buddy_huge_page() still uses task policy to allocate
surplus huge pages.  This could be changed.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/hugetlb.c |   23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c
===================================================================
--- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c	2009-06-04 12:59:32.000000000 -0400
+++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c	2009-06-04 12:59:33.000000000 -0400
@@ -637,9 +637,9 @@ static struct page *alloc_fresh_huge_pag
 static int hstate_next_node(struct hstate *h)
 {
 	int next_nid;
-	next_nid = next_node(h->hugetlb_next_nid, node_online_map);
+	next_nid = next_node(h->hugetlb_next_nid, *h->nodes_allowed);
 	if (next_nid == MAX_NUMNODES)
-		next_nid = first_node(node_online_map);
+		next_nid = first_node(*h->nodes_allowed);
 	h->hugetlb_next_nid = next_nid;
 	return next_nid;
 }
@@ -652,6 +652,11 @@ static int alloc_fresh_huge_page(struct 
 	int ret = 0;
 
 	start_nid = h->hugetlb_next_nid;
+	/*
+	 * we may have allocated with a different nodes_allowed previously
+	 */
+	if (!node_isset(start_nid, *h->nodes_allowed))
+		start_nid = hstate_next_node(h);
 
 	do {
 		page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
@@ -1169,20 +1174,28 @@ static inline void try_to_free_low(struc
 
 /*
  * Increment or decrement surplus_huge_pages.  Keep node-specific counters
- * balanced by operating on them in a round-robin fashion.
+ * balanced by operating on them in a round-robin fashion.  Use nodes_allowed
+ * mask when decreasing suplus pages as we're "promoting" them to persistent.
+ * Use node_online_map for increment surplus pages as we're demoting previously
+ * persistent huge pages.
+ * Called holding the hugetlb_lock.
  * Returns 1 if an adjustment was made.
  */
 static int adjust_pool_surplus(struct hstate *h, int delta)
 {
+	nodemask_t *nodemask = &node_online_map;
 	static int prev_nid;
 	int nid = prev_nid;
 	int ret = 0;
 
 	VM_BUG_ON(delta != -1 && delta != 1);
+	if (delta < 0)
+		nodemask = h->nodes_allowed;
+
 	do {
-		nid = next_node(nid, node_online_map);
+		nid = next_node(nid, *nodemask);
 		if (nid == MAX_NUMNODES)
-			nid = first_node(node_online_map);
+			nid = first_node(*nodemask);
 
 		/* To shrink on this node, there must be a surplus page */
 		if (delta < 0 && !h->surplus_huge_pages_node[nid])

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 4/5] Add sysctl for default hstate nodes_allowed.
  2009-06-16 13:52 [PATCH 0/5] Huge Pages Nodes Allowed Lee Schermerhorn
                   ` (2 preceding siblings ...)
  2009-06-16 13:53 ` [PATCH 3/5] Use per hstate nodes_allowed to constrain huge page allocation Lee Schermerhorn
@ 2009-06-16 13:53 ` Lee Schermerhorn
  2009-06-17 13:41   ` Mel Gorman
  2009-06-16 13:53 ` [PATCH 5/5] Update huge pages kernel documentation Lee Schermerhorn
  2009-06-17 13:02 ` [PATCH 0/5] Huge Pages Nodes Allowed Mel Gorman
  5 siblings, 1 reply; 31+ messages in thread
From: Lee Schermerhorn @ 2009-06-16 13:53 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, Adam Litke,
	Andy Whitcroft, eric.whitney

[PATCH 4/5] add sysctl for default hstate nodes_allowed.

Against:  17may09 mmotm

This patch adds a sysctl -- /proc/sys/vm/hugepages_nodes_allowed --
to set/query the default hstate's nodes_allowed.  I don't know
that this patch is required, given that we have the per hstate
controls in /sys/kernel/mm/hugepages/*. However, we've added sysctls
for other recent hugepages controls, like nr_overcommit_hugepages,
so I've followed that convention.

Factor the formatting of the nodes_allowed mask out of nodes_allowed_show()
for use by both that function and the hugetlb_nodes_allowed_handler().

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp>

 include/linux/hugetlb.h |    1 +
 kernel/sysctl.c         |    8 ++++++++
 mm/hugetlb.c            |   43 ++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 47 insertions(+), 5 deletions(-)

Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h
===================================================================
--- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h	2009-06-04 12:59:32.000000000 -0400
+++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h	2009-06-04 12:59:35.000000000 -0400
@@ -22,6 +22,7 @@ void reset_vma_resv_huge_pages(struct vm
 int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
 int hugetlb_overcommit_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
 int hugetlb_treat_movable_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
+int hugetlb_nodes_allowed_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
 int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int, int);
 void unmap_hugepage_range(struct vm_area_struct *,
Index: linux-2.6.30-rc8-mmotm-090603-1633/kernel/sysctl.c
===================================================================
--- linux-2.6.30-rc8-mmotm-090603-1633.orig/kernel/sysctl.c	2009-06-04 12:59:26.000000000 -0400
+++ linux-2.6.30-rc8-mmotm-090603-1633/kernel/sysctl.c	2009-06-04 12:59:35.000000000 -0400
@@ -1108,6 +1108,14 @@ static struct ctl_table vm_table[] = {
 		.extra1		= (void *)&hugetlb_zero,
 		.extra2		= (void *)&hugetlb_infinity,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "hugepages_nodes_allowed",
+		.data		= NULL,
+		.maxlen		= sizeof(unsigned long),
+		.mode		= 0644,
+		.proc_handler	= &hugetlb_nodes_allowed_handler,
+	},
 #endif
 	{
 		.ctl_name	= VM_LOWMEM_RESERVE_RATIO,
Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c
===================================================================
--- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c	2009-06-04 12:59:33.000000000 -0400
+++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c	2009-06-04 12:59:35.000000000 -0400
@@ -1354,19 +1354,27 @@ static ssize_t nr_overcommit_hugepages_s
 }
 HSTATE_ATTR(nr_overcommit_hugepages);
 
-static ssize_t nodes_allowed_show(struct kobject *kobj,
-					struct kobj_attribute *attr, char *buf)
+static int format_hstate_nodes_allowed(struct hstate *h, char *buf,
+					size_t buflen)
 {
-	struct hstate *h = kobj_to_hstate(kobj);
 	int len = 3;
 
 	if (h->nodes_allowed == &node_online_map)
 		strcpy(buf, "all");
 	else
-		len = nodelist_scnprintf(buf, PAGE_SIZE,
+		len = nodelist_scnprintf(buf, buflen,
 					*h->nodes_allowed);
+	return len;
+
+}
+
+static ssize_t nodes_allowed_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	struct hstate *h = kobj_to_hstate(kobj);
+	int len =  format_hstate_nodes_allowed(h, buf, PAGE_SIZE);
 
-	if (len)
+	if (len && (len +1) < PAGE_SIZE)
 		buf[len++] = '\n';
 
 	return len;
@@ -1684,6 +1692,31 @@ int hugetlb_overcommit_handler(struct ct
 	return 0;
 }
 
+#define NODES_ALLOWED_MAX 64
+int hugetlb_nodes_allowed_handler(struct ctl_table *table, int write,
+			struct file *file, void __user *buffer,
+			size_t *length, loff_t *ppos)
+{
+	struct hstate *h = &default_hstate;
+	int ret = 0;
+
+	if (write) {
+		(void)set_hstate_nodes_allowed(h, buffer, 1);
+	} else {
+		char buf[NODES_ALLOWED_MAX];
+		struct ctl_table tbl = {
+			.data = buf,
+			.maxlen = NODES_ALLOWED_MAX,
+		};
+		int len =  format_hstate_nodes_allowed(h, buf, sizeof(buf));
+
+		if (len)
+			ret = proc_dostring(&tbl, write, file, buffer,
+						 length, ppos);
+	}
+	return ret;
+}
+
 #endif /* CONFIG_SYSCTL */
 
 void hugetlb_report_meminfo(struct seq_file *m)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 5/5] Update huge pages kernel documentation
  2009-06-16 13:52 [PATCH 0/5] Huge Pages Nodes Allowed Lee Schermerhorn
                   ` (3 preceding siblings ...)
  2009-06-16 13:53 ` [PATCH 4/5] Add sysctl for default hstate nodes_allowed Lee Schermerhorn
@ 2009-06-16 13:53 ` Lee Schermerhorn
  2009-06-18 18:49   ` David Rientjes
  2009-06-17 13:02 ` [PATCH 0/5] Huge Pages Nodes Allowed Mel Gorman
  5 siblings, 1 reply; 31+ messages in thread
From: Lee Schermerhorn @ 2009-06-16 13:53 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, Adam Litke,
	Andy Whitcroft, eric.whitney

PATCH 5/5 - update huge pages kernel documentation

Against:  17may09 mmotm

Add description of nodes_allowed sysctl and boot parameter and
adjust surrounding context to accommodate the description.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/vm/hugetlbpage.txt |  105 ++++++++++++++++++++++++++++-----------
 1 file changed, 77 insertions(+), 28 deletions(-)

Index: linux-2.6.30-rc8-mmotm-090603-1633/Documentation/vm/hugetlbpage.txt
===================================================================
--- linux-2.6.30-rc8-mmotm-090603-1633.orig/Documentation/vm/hugetlbpage.txt	2009-06-04 11:05:11.000000000 -0400
+++ linux-2.6.30-rc8-mmotm-090603-1633/Documentation/vm/hugetlbpage.txt	2009-06-04 12:59:52.000000000 -0400
@@ -18,13 +18,13 @@ First the Linux kernel needs to be built
 automatically when CONFIG_HUGETLBFS is selected) configuration
 options.
 
-The kernel built with hugepage support should show the number of configured
-hugepages in the system by running the "cat /proc/meminfo" command.
+The kernel built with huge page support should show the number of configured
+huge pages in the system by running the "cat /proc/meminfo" command.
 
 /proc/meminfo also provides information about the total number of hugetlb
 pages configured in the kernel.  It also displays information about the
 number of free hugetlb pages at any time.  It also displays information about
-the configured hugepage size - this is needed for generating the proper
+the configured huge page size - this is needed for generating the proper
 alignment and size of the arguments to the above system calls.
 
 The output of "cat /proc/meminfo" will have lines like:
@@ -37,25 +37,23 @@ HugePages_Surp:  yyy
 Hugepagesize:    zzz kB
 
 where:
-HugePages_Total is the size of the pool of hugepages.
-HugePages_Free is the number of hugepages in the pool that are not yet
-allocated.
-HugePages_Rsvd is short for "reserved," and is the number of hugepages
-for which a commitment to allocate from the pool has been made, but no
-allocation has yet been made. It's vaguely analogous to overcommit.
-HugePages_Surp is short for "surplus," and is the number of hugepages in
-the pool above the value in /proc/sys/vm/nr_hugepages. The maximum
-number of surplus hugepages is controlled by
-/proc/sys/vm/nr_overcommit_hugepages.
+HugePages_Total is the size of the pool of huge pages.
+HugePages_Free  is the number of huge pages in the pool that are not yet
+                allocated.
+HugePages_Rsvd  is short for "reserved," and is the number of huge pages for
+                which a commitment to allocate from the pool has been made,
+                but no allocation has yet been made. It's vaguely analogous
+                to overcommit.
+HugePages_Surp  is short for "surplus," and is the number of huge pages in
+                the pool above the value in /proc/sys/vm/nr_hugepages. The
+                maximum number of surplus huge pages is controlled by
+                /proc/sys/vm/nr_overcommit_hugepages.
 
 /proc/filesystems should also show a filesystem of type "hugetlbfs" configured
 in the kernel.
 
-/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
-pages in the kernel.  Super user can dynamically request more (or free some
-pre-configured) hugepages.
 The allocation (or deallocation) of hugetlb pages is possible only if there are
-enough physically contiguous free pages in system (freeing of hugepages is
+enough physically contiguous free pages in system (freeing of huge pages is
 possible only if there are enough hugetlb pages free that can be transferred
 back to regular memory pool).
 
@@ -67,26 +65,76 @@ use either the mmap system call or share
 the huge pages.  It is required that the system administrator preallocate
 enough memory for huge page purposes.
 
-Use the following command to dynamically allocate/deallocate hugepages:
+The administrator can preallocate huge pages on the kernel boot command line by
+specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
+requested.  This is the most reliable method for preallocating huge pages as
+memory has not yet become fragmented.
+
+Some platforms support multiple huge page sizes.  To preallocate huge pages
+of a specific size, one must preceed the huge pages boot command parameters
+with a huge page size selection parameter "hugepagesz=<size>".  <size> must
+be specified in bytes with optional scale suffix [kKmMgG].  The default huge
+page size may be selected with the "default_hugepagesz=<size>" boot parameter.
+
+/proc/sys/vm/nr_hugepages indicates the current number of configured [default
+size] hugetlb pages in the kernel.  Super user can dynamically request more
+(or free some pre-configured) hugepages.
+
+Use the following command to dynamically allocate/deallocate default sized
+hugepages:
 
 	echo 20 > /proc/sys/vm/nr_hugepages
 
-This command will try to configure 20 hugepages in the system.  The success
-or failure of allocation depends on the amount of physically contiguous
-memory that is preset in system at this time.  System administrators may want
-to put this command in one of the local rc init files.  This will enable the
-kernel to request huge pages early in the boot process (when the possibility
-of getting physical contiguous pages is still very high). In either
-case, administrators will want to verify the number of hugepages actually
-allocated by checking the sysctl or meminfo.
+This command will try to configure 20 default sized hugepages in the system.
+On a NUMA platform, the kernel will attempt to distribute the hugepage pool
+over the nodes specified by the /proc/sys/vm/hugepages_nodes_allowed node mask.
+hugepages_nodes_allowed defaults to all on-line nodes.
+
+To control the nodes on which huge pages are preallocated, the administrator
+may set the hugepages_nodes_allowed for the default huge page size using:
+
+	echo <nodelist> >/proc/sys/vm/hugepages_nodes_allowed
+
+where <nodelist> is a comma separated list of one or more node ranges.  For
+example, "1,3-5" specifies nodes 1, 3, 4 and 5.  Specify "all" to request
+huge page preallocation on all on-line nodes.  The hugepages_nodes_allowed
+parameter may be specified on the kernel boot command line.
+
+The success or failure of allocation depends on the amount of physically
+contiguous memory that is preset in system at this time.  If the kernel is
+unable to allocate hugepages from some nodes in a NUMA system, it will
+attempt to make up the difference by allocating extra pages on other nodes
+with available contiguous memory, if any, within the constraints of the
+allowed nodes.
+
+System administrators may want to put this command in one of the local rc init
+files.  This will enable the kernel to request hugepages early in the boot
+process (when the possibility of getting physical contiguous pages is still
+very high). In either case, administrators will want to verify the number of
+hugepages actually allocated by checking the sysctl or meminfo.  To check the
+per node distribution of huge pages, use:
+
+	cat /sys/devices/system/node/node*/meminfo | fgrep Huge
 
 /proc/sys/vm/nr_overcommit_hugepages indicates how large the pool of
 hugepages can grow, if more hugepages than /proc/sys/vm/nr_hugepages are
 requested by applications. echo'ing any non-zero value into this file
-indicates that the hugetlb subsystem is allowed to try to obtain
+indicates that the hugetlb subsystem is allowed to try to obtain "surplus"
 hugepages from the buddy allocator, if the normal pool is exhausted. As
 these surplus hugepages go out of use, they are freed back to the buddy
-allocator.
+allocator.  Note that surplus hugepages are not constrained by the
+hugepages_nodes_allowed mask.
+
+When increasing the huge page pool size via nr_hugepages, any surplus
+pages on the allowed nodes will first be promoted to persistent huge
+pages.  Then, additional huge pages will be allocated from the allowed
+nodes, if necessary and if possible, to fulfil the new pool size.
+
+The administrator may shrink the pool of preallocated huge pages for
+the default huge page size by setting the nr_hugepages sysctl to a
+smaller value.  The kernel will attempt to free huge pages "round robin"
+across all on-line nodes, ignoring the nodes_allowed mask.  Any free huge
+pages on the selected nodes will be freed back to the buddy allocator.
 
 Caveat: Shrinking the pool via nr_hugepages such that it becomes less
 than the number of hugepages in use will convert the balance to surplus
@@ -112,6 +160,7 @@ Inside each of these directories, the sa
 
 	nr_hugepages
 	nr_overcommit_hugepages
+	nodes_allowed
 	free_hugepages
 	resv_hugepages
 	surplus_hugepages

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/5] Huge Pages Nodes Allowed
  2009-06-16 13:52 [PATCH 0/5] Huge Pages Nodes Allowed Lee Schermerhorn
                   ` (4 preceding siblings ...)
  2009-06-16 13:53 ` [PATCH 5/5] Update huge pages kernel documentation Lee Schermerhorn
@ 2009-06-17 13:02 ` Mel Gorman
  2009-06-17 17:15   ` Lee Schermerhorn
  5 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2009-06-17 13:02 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Nishanth Aravamudan, Adam Litke, Andy Whitcroft,
	eric.whitney

On Tue, Jun 16, 2009 at 09:52:28AM -0400, Lee Schermerhorn wrote:
> Because of assymmetries in some NUMA platforms, and "interesting"
> topologies emerging in the "scale up x86" world, we have need for
> better control over the placement of "fresh huge pages".  A while
> back Nish Aravamundan floated a series of patches to add per node
> controls for allocating pages to the hugepage pool and removing
> them.  Nish apparently moved on to other tasks before those patches
> were accepted.  I have kept a copy of Nish's patches and have
> intended to rebase and test them and resubmit.
> 
> In an [off-list] exchange with Mel Gorman, who admits to knowledge
> in the huge pages area, I asked his opinion of per node controls
> for huge pages and he suggested another approach:  using the mempolicy
> of the task that changes nr_hugepages to constrain the fresh huge
> page allocations.  I considered this approach but it seemed to me
> to be a misuse of mempolicy for populating the huge pages free
> pool. 

Why would it be a misuse? Fundamentally, the huge page pools are being
filled by the current process when nr_hugepages is being used. Or are
you concerned about the specification of hugepages on the kernel command
line?

> Interleave policy doesn't have same "this node" semantics
> that we want

By "this node" semantics, do you mean allocating from one specific node?
In that case, why would specifying a nodemask of just one node not be
sufficient?

> and bind policy would require constructing a custom
> node mask for node as well as addressing OOM, which we don't want
> during fresh huge page allocation. 

Would the required mask not already be setup when the process set the
policy? OOM is not a major concern, it doesn't trigger for failed
hugepage allocations.

> One could derive a node mask
> of allowed nodes for huge pages from the mempolicy of the task
> that is modifying nr_hugepages and use that for fresh huge pages
> with GFP_THISNODE.  However, if we're not going to use mempolicy
> directly--e.g., via alloc_page_current() or alloc_page_vma() [with
> yet another on-stack pseudo-vma :(]--I thought it cleaner to
> define a "nodes allowed" nodemask for populating the [persistent]
> huge pages free pool.
> 

How about adding alloc_page_mempolicy() that takes the explicit mempolicy
you need?

> This patch series introduces a [per hugepage size] "sysctl",
> hugepages_nodes_allowed, that specifies a nodemask to constrain
> the allocation of persistent, fresh huge pages.   The nodemask
> may be specified by a sysctl, a sysfs huge pages attribute and
> on the kernel boot command line.  
> 
> The series includes a patch to free hugepages from the pool in a
> "round robin" fashion, interleaved across all on-line nodes to
> balance the hugepage pool across nodes.  Nish had a patch to do
> this, too.
> 
> Together, these changes don't provide the fine grain of control
> that per node attributes would. 

I'm failing to understand at the moment why mem policies set by numactl
would not do the job for allocation at least. Freeing is a different problem.

> Specifically, there is no easy
> way to reduce the persistent huge page count for a specific node.
> I think the degree of control provided by these patches is the
> minimal necessary and sufficient for managing the persistent the
> huge page pool.  However, with a bit more reorganization,  we
> could implement per node controls if others would find that
> useful.
> 
> For more info, see the patch descriptions and the updated kernel
> hugepages documentation.
> 


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] Free huge pages round robin to balance across nodes
  2009-06-16 13:52 ` [PATCH 1/5] Free huge pages round robin to balance across nodes Lee Schermerhorn
@ 2009-06-17 13:18   ` Mel Gorman
  2009-06-17 17:16     ` Lee Schermerhorn
  0 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2009-06-17 13:18 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Nishanth Aravamudan, Adam Litke, Andy Whitcroft,
	eric.whitney

On Tue, Jun 16, 2009 at 09:52:36AM -0400, Lee Schermerhorn wrote:
> [PATCH 1/5] Free huge pages round robin to balance across nodes
> 
> Against:  17may09 mmotm
> 
> Currently, altho' increasing nr_hugepages will [attempt to]
> distribute the new huge pages across all nodes in the system,
> reducing nr_hugepages will free or surplus all free pages
> from nodes in node id order.  This patch frees huges pages
> from nodes in round robin fashion in an attempt to keep
> [persistent] hugepage allocates balanced across the nodes.
> 
> New function free_pool_huge_page() is modeled on and
> performs roughly the inverse of alloc_fresh_huge_page().
> Replaces dequeue_huge_page() which now has no callers
> and can be removed.
> 
> Helper function hstate_next_to_free_node() uses new hstate
> member next_to_free_nid to distribute "frees" across all
> nodes with huge pages.
> 
> I placed this patch first in the series because I think it
> [or something similar] should be applied independent of the
> rest of the series.  
> 

Agreed. Reading though, I can't see any problems with the patch and it
does make the freeing symmetric with the allocation.

For clarity though, would it be worth renaming hugetlb_next_nid to
next_to_alloc_nid so that there is a clear relationship in the
round-robin allocation and freeing of pages amoung online nodes?

> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  include/linux/hugetlb.h |    1 
>  mm/hugetlb.c            |   68 +++++++++++++++++++++++++++++++++---------------
>  2 files changed, 48 insertions(+), 21 deletions(-)
> 
> Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h
> ===================================================================
> --- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h	2009-06-04 12:59:29.000000000 -0400
> +++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h	2009-06-04 12:59:31.000000000 -0400
> @@ -184,6 +184,7 @@ unsigned long hugetlb_get_unmapped_area(
>  /* Defines one hugetlb page size */
>  struct hstate {
>  	int hugetlb_next_nid;
> +	int next_to_free_nid;
>  	unsigned int order;
>  	unsigned long mask;
>  	unsigned long max_huge_pages;
> Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c	2009-06-04 12:59:29.000000000 -0400
> +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c	2009-06-04 12:59:31.000000000 -0400
> @@ -455,24 +455,6 @@ static void enqueue_huge_page(struct hst
>  	h->free_huge_pages_node[nid]++;
>  }
>  
> -static struct page *dequeue_huge_page(struct hstate *h)
> -{
> -	int nid;
> -	struct page *page = NULL;
> -
> -	for (nid = 0; nid < MAX_NUMNODES; ++nid) {
> -		if (!list_empty(&h->hugepage_freelists[nid])) {
> -			page = list_entry(h->hugepage_freelists[nid].next,
> -					  struct page, lru);
> -			list_del(&page->lru);
> -			h->free_huge_pages--;
> -			h->free_huge_pages_node[nid]--;
> -			break;
> -		}
> -	}
> -	return page;
> -}
> -
>  static struct page *dequeue_huge_page_vma(struct hstate *h,
>  				struct vm_area_struct *vma,
>  				unsigned long address, int avoid_reserve)
> @@ -683,6 +665,52 @@ static int alloc_fresh_huge_page(struct 
>  	return ret;
>  }
>  
> +/*
> + * helper for free_pool_huge_page() - find next node
> + * from which to free a huge page
> + */
> +static int hstate_next_to_free_node(struct hstate *h)
> +{
> +	int next_nid;
> +	next_nid = next_node(h->next_to_free_nid, node_online_map);
> +	if (next_nid == MAX_NUMNODES)
> +		next_nid = first_node(node_online_map);
> +	h->next_to_free_nid = next_nid;
> +	return next_nid;
> +}
> +
> +/*
> + * Free huge page from pool from next node to free.
> + * Attempt to keep persistent huge pages more or less
> + * balanced over allowed nodes.
> + * Called with hugetlb_lock locked.
> + */
> +static int free_pool_huge_page(struct hstate *h)
> +{
> +	int start_nid;
> +	int nid;
> +	int ret = 0;
> +
> +	start_nid = h->next_to_free_nid;
> +	nid = h->next_to_free_nid;
> +
> +	do {
> +		if (!list_empty(&h->hugepage_freelists[nid])) {
> +			struct page *page =
> +				list_entry(h->hugepage_freelists[nid].next,
> +					  struct page, lru);
> +			list_del(&page->lru);
> +			h->free_huge_pages--;
> +			h->free_huge_pages_node[nid]--;
> +			update_and_free_page(h, page);
> +			ret = 1;
> +		}
> +		nid = hstate_next_to_free_node(h);
> +	} while (!ret && nid != start_nid);
> +
> +	return ret;
> +}
> +
>  static struct page *alloc_buddy_huge_page(struct hstate *h,
>  			struct vm_area_struct *vma, unsigned long address)
>  {
> @@ -1226,10 +1254,8 @@ static unsigned long set_max_huge_pages(
>  	min_count = max(count, min_count);
>  	try_to_free_low(h, min_count);
>  	while (min_count < persistent_huge_pages(h)) {
> -		struct page *page = dequeue_huge_page(h);
> -		if (!page)
> +		if (!free_pool_huge_page(h))
>  			break;
> -		update_and_free_page(h, page);
>  	}
>  	while (count < persistent_huge_pages(h)) {
>  		if (!adjust_pool_surplus(h, 1))
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/5] Add nodes_allowed members to hugepages hstate struct
  2009-06-16 13:52 ` [PATCH 2/5] Add nodes_allowed members to hugepages hstate struct Lee Schermerhorn
@ 2009-06-17 13:35   ` Mel Gorman
  2009-06-17 17:38     ` Lee Schermerhorn
  0 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2009-06-17 13:35 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Nishanth Aravamudan, Adam Litke, Andy Whitcroft,
	eric.whitney

On Tue, Jun 16, 2009 at 09:52:53AM -0400, Lee Schermerhorn wrote:
> [PATCH 2/5] add nodes_allowed members to hugepages hstate struct
> 
> Against:  17may09 mmotm
> 
> This patch adds a nodes_allowed nodemask_t pointer and a
> __nodes_allowed nodemask_t to the hstate struct for constraining
> fresh hugepage allocations.  It then adds sysfs attributes and
> boot command line options to set and [for sysfs attributes] query
> the allowed nodes mask.
> 
> A separate patch will hook up this nodes_allowed mask/pointer to
> fresh huge page allocation and promoting of surplus pages to
> persistent.  Another patch will add a 'sysctl' hugepages_nodes_allowed
> to /proc/sys/vm.
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  include/linux/hugetlb.h |    2 +
>  mm/hugetlb.c            |   86 ++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 88 insertions(+)
> 
> Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h
> ===================================================================
> --- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h	2009-06-04 12:59:31.000000000 -0400
> +++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h	2009-06-04 12:59:32.000000000 -0400
> @@ -193,6 +193,8 @@ struct hstate {
>  	unsigned long resv_huge_pages;
>  	unsigned long surplus_huge_pages;
>  	unsigned long nr_overcommit_huge_pages;
> +	nodemask_t *nodes_allowed;
> +	nodemask_t __nodes_allowed;

Can you add a comment as to why a nodemask pointer and a nodemask itself
with very similar field names are both needed?

>  	struct list_head hugepage_freelists[MAX_NUMNODES];
>  	unsigned int nr_huge_pages_node[MAX_NUMNODES];
>  	unsigned int free_huge_pages_node[MAX_NUMNODES];
> Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c	2009-06-04 12:59:31.000000000 -0400
> +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c	2009-06-04 12:59:32.000000000 -0400
> @@ -40,6 +40,9 @@ __initdata LIST_HEAD(huge_boot_pages);
>  static struct hstate * __initdata parsed_hstate;
>  static unsigned long __initdata default_hstate_max_huge_pages;
>  static unsigned long __initdata default_hstate_size;
> +static struct hstate __initdata default_boot_hstate = {
> +	.nodes_allowed = &node_online_map,
> +};
>  
>  #define for_each_hstate(h) \
>  	for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++)
> @@ -1102,6 +1105,9 @@ static void __init hugetlb_init_hstates(
>  	struct hstate *h;
>  
>  	for_each_hstate(h) {
> +		if (!h->nodes_allowed)
> +			h->nodes_allowed = &node_online_map;
> +

The reasoning for two nodes_allowed would appear to be to allow nodes_allowed
to be some predefined mash or a hstate-private mask. As the mask has to exist
anyway, can you not just copy it in rather than having a pointer and a mask?

It would make the check below as to whether all nodes allowed or not a bit
more expensive but it's not a big deal.

>  		/* oversize hugepages were init'ed in early boot */
>  		if (h->order < MAX_ORDER)
>  			hugetlb_hstate_alloc_pages(h);
> @@ -1335,6 +1341,62 @@ static ssize_t nr_overcommit_hugepages_s
>  }
>  HSTATE_ATTR(nr_overcommit_hugepages);
>  
> +static ssize_t nodes_allowed_show(struct kobject *kobj,
> +					struct kobj_attribute *attr, char *buf)
> +{
> +	struct hstate *h = kobj_to_hstate(kobj);
> +	int len = 3;
> +
> +	if (h->nodes_allowed == &node_online_map)
> +		strcpy(buf, "all");
> +	else
> +		len = nodelist_scnprintf(buf, PAGE_SIZE,
> +					*h->nodes_allowed);
> +
> +	if (len)
> +		buf[len++] = '\n';
> +

buf doesn't get NULL terminated.

> +	return len;
> +}

Can print_nodes_state() be extended a little to print "all" and then shared
with here?

> +
> +static int set_hstate_nodes_allowed(struct hstate *h, const char *buf,
> +					bool lock)
> +{
> +	nodemask_t nodes_allowed;
> +	int ret = 1;
> +	bool all = !strncasecmp(buf, "all", 3);
> +
> +	if (!all)
> +		ret = !nodelist_parse(buf, nodes_allowed);
> +	if (ret) {
> +		if (lock)
> +			spin_lock(&hugetlb_lock);
> +

ick.

Convention for something like this is to have set_hstate_nodes_allowed
that takes a spinlock and __set_hstate_nodes_allowed that is the
lock-free version and call as appropriate. Can the same be done here?

For that matter, does taking spinlocks from __setup() break that you
avoid taking it in that case?

> +		if (all) {
> +			h->nodes_allowed = &node_online_map;
> +		} else {
> +			h->__nodes_allowed = nodes_allowed;
> +			h->nodes_allowed = &h->__nodes_allowed;
> +		}
> +
> +		if (lock)
> +			spin_unlock(&hugetlb_lock);
> +	}
> +	return ret;
> +}
> +
> +static ssize_t nodes_allowed_store(struct kobject *kobj,
> +		struct kobj_attribute *attr, const char *buf, size_t count)
> +{
> +	struct hstate *h = kobj_to_hstate(kobj);
> +
> +	if (set_hstate_nodes_allowed(h, buf, 1))
> +		return count;
> +	else
> +		return 0;
> +}
> +HSTATE_ATTR(nodes_allowed);
> +
>  static ssize_t free_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> @@ -1362,6 +1424,7 @@ HSTATE_ATTR_RO(surplus_hugepages);
>  static struct attribute *hstate_attrs[] = {
>  	&nr_hugepages_attr.attr,
>  	&nr_overcommit_hugepages_attr.attr,
> +	&nodes_allowed_attr.attr,
>  	&free_hugepages_attr.attr,
>  	&resv_hugepages_attr.attr,
>  	&surplus_hugepages_attr.attr,
> @@ -1436,6 +1499,13 @@ static int __init hugetlb_init(void)
>  	if (default_hstate_max_huge_pages)
>  		default_hstate.max_huge_pages = default_hstate_max_huge_pages;
>  
> +	if (default_boot_hstate.nodes_allowed != &node_online_map) {
> +		default_hstate.__nodes_allowed =
> +					default_boot_hstate.__nodes_allowed;
> +		default_hstate.nodes_allowed =
> +					&default_hstate.__nodes_allowed;
> +	}
> +
>  	hugetlb_init_hstates();
>  
>  	gather_bootmem_prealloc();
> @@ -1471,6 +1541,7 @@ void __init hugetlb_add_hstate(unsigned 
>  	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
>  					huge_page_size(h)/1024);
>  
> +	h->nodes_allowed = &node_online_map;
>  	parsed_hstate = h;
>  }
>  
> @@ -1518,6 +1589,21 @@ static int __init hugetlb_default_setup(
>  }
>  __setup("default_hugepagesz=", hugetlb_default_setup);
>  
> +static int __init hugetlb_nodes_allowed_setup(char *s)
> +{
> +	struct hstate *h = &default_boot_hstate;
> +
> +	/*
> +	 * max_hstate means we've parsed a hugepagesz= parameter, so
> +	 * use the [most recently] parsed_hstate.  Else use default.
> +	 */
> +	if (max_hstate)
> +		h = parsed_hstate;
> +
> +	return set_hstate_nodes_allowed(h, s, 0);
> +}
> +__setup("hugepages_nodes_allowed=", hugetlb_nodes_allowed_setup);
> +
>  static unsigned int cpuset_mems_nr(unsigned int *array)
>  {
>  	int node;
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] Use per hstate nodes_allowed to constrain huge page allocation
  2009-06-16 13:53 ` [PATCH 3/5] Use per hstate nodes_allowed to constrain huge page allocation Lee Schermerhorn
@ 2009-06-17 13:39   ` Mel Gorman
  2009-06-17 17:47     ` Lee Schermerhorn
  0 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2009-06-17 13:39 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Nishanth Aravamudan, Adam Litke, Andy Whitcroft,
	eric.whitney

On Tue, Jun 16, 2009 at 09:53:01AM -0400, Lee Schermerhorn wrote:
> [PATCH 3/5] Use per hstate nodes_allowed to constrain huge page allocation
> 
> Against:  17may09 mmotm
> 
> Select only nodes from the per hstate nodes_allowed mask when
> promoting surplus pages to persistent or when allocating fresh
> huge pages to the pool.
> 
> Note that alloc_buddy_huge_page() still uses task policy to allocate
> surplus huge pages.  This could be changed.
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  mm/hugetlb.c |   23 ++++++++++++++++++-----
>  1 file changed, 18 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c	2009-06-04 12:59:32.000000000 -0400
> +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c	2009-06-04 12:59:33.000000000 -0400
> @@ -637,9 +637,9 @@ static struct page *alloc_fresh_huge_pag
>  static int hstate_next_node(struct hstate *h)
>  {
>  	int next_nid;
> -	next_nid = next_node(h->hugetlb_next_nid, node_online_map);
> +	next_nid = next_node(h->hugetlb_next_nid, *h->nodes_allowed);
>  	if (next_nid == MAX_NUMNODES)
> -		next_nid = first_node(node_online_map);
> +		next_nid = first_node(*h->nodes_allowed);
>  	h->hugetlb_next_nid = next_nid;
>  	return next_nid;
>  }
> @@ -652,6 +652,11 @@ static int alloc_fresh_huge_page(struct 
>  	int ret = 0;
>  
>  	start_nid = h->hugetlb_next_nid;
> +	/*
> +	 * we may have allocated with a different nodes_allowed previously
> +	 */
> +	if (!node_isset(start_nid, *h->nodes_allowed))
> +		start_nid = hstate_next_node(h);
>  
>  	do {
>  		page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
> @@ -1169,20 +1174,28 @@ static inline void try_to_free_low(struc
>  
>  /*
>   * Increment or decrement surplus_huge_pages.  Keep node-specific counters
> - * balanced by operating on them in a round-robin fashion.
> + * balanced by operating on them in a round-robin fashion.  Use nodes_allowed
> + * mask when decreasing suplus pages as we're "promoting" them to persistent.

s/suplus/surplus/

> + * Use node_online_map for increment surplus pages as we're demoting previously
> + * persistent huge pages.
> + * Called holding the hugetlb_lock.
>   * Returns 1 if an adjustment was made.
>   */
>  static int adjust_pool_surplus(struct hstate *h, int delta)
>  {
> +	nodemask_t *nodemask = &node_online_map;
>  	static int prev_nid;
>  	int nid = prev_nid;
>  	int ret = 0;
>  
>  	VM_BUG_ON(delta != -1 && delta != 1);
> +	if (delta < 0)
> +		nodemask = h->nodes_allowed;
> +

Please spell out why nodes_allowed is only used when decreasing the surplus
count.

>  	do {
> -		nid = next_node(nid, node_online_map);
> +		nid = next_node(nid, *nodemask);
>  		if (nid == MAX_NUMNODES)
> -			nid = first_node(node_online_map);
> +			nid = first_node(*nodemask);
>  
>  		/* To shrink on this node, there must be a surplus page */
>  		if (delta < 0 && !h->surplus_huge_pages_node[nid])
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/5] Add sysctl for default hstate nodes_allowed.
  2009-06-16 13:53 ` [PATCH 4/5] Add sysctl for default hstate nodes_allowed Lee Schermerhorn
@ 2009-06-17 13:41   ` Mel Gorman
  2009-06-17 17:52     ` Lee Schermerhorn
  0 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2009-06-17 13:41 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Nishanth Aravamudan, Adam Litke, Andy Whitcroft,
	eric.whitney

On Tue, Jun 16, 2009 at 09:53:08AM -0400, Lee Schermerhorn wrote:
> [PATCH 4/5] add sysctl for default hstate nodes_allowed.
> 
> Against:  17may09 mmotm
> 
> This patch adds a sysctl -- /proc/sys/vm/hugepages_nodes_allowed --
> to set/query the default hstate's nodes_allowed.  I don't know
> that this patch is required, given that we have the per hstate
> controls in /sys/kernel/mm/hugepages/*. However, we've added sysctls
> for other recent hugepages controls, like nr_overcommit_hugepages,
> so I've followed that convention.
> 

Yeah, it's somewhat expected that what is in /proc/sys/vm is information
on the default hugepage size.

> Factor the formatting of the nodes_allowed mask out of nodes_allowed_show()
> for use by both that function and the hugetlb_nodes_allowed_handler().
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp>
> 
>  include/linux/hugetlb.h |    1 +
>  kernel/sysctl.c         |    8 ++++++++
>  mm/hugetlb.c            |   43 ++++++++++++++++++++++++++++++++++++++-----
>  3 files changed, 47 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h
> ===================================================================
> --- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h	2009-06-04 12:59:32.000000000 -0400
> +++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h	2009-06-04 12:59:35.000000000 -0400
> @@ -22,6 +22,7 @@ void reset_vma_resv_huge_pages(struct vm
>  int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
>  int hugetlb_overcommit_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
>  int hugetlb_treat_movable_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
> +int hugetlb_nodes_allowed_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
>  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
>  int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int, int);
>  void unmap_hugepage_range(struct vm_area_struct *,
> Index: linux-2.6.30-rc8-mmotm-090603-1633/kernel/sysctl.c
> ===================================================================
> --- linux-2.6.30-rc8-mmotm-090603-1633.orig/kernel/sysctl.c	2009-06-04 12:59:26.000000000 -0400
> +++ linux-2.6.30-rc8-mmotm-090603-1633/kernel/sysctl.c	2009-06-04 12:59:35.000000000 -0400
> @@ -1108,6 +1108,14 @@ static struct ctl_table vm_table[] = {
>  		.extra1		= (void *)&hugetlb_zero,
>  		.extra2		= (void *)&hugetlb_infinity,
>  	},
> +	{
> +		.ctl_name	= CTL_UNNUMBERED,
> +		.procname	= "hugepages_nodes_allowed",
> +		.data		= NULL,
> +		.maxlen		= sizeof(unsigned long),
> +		.mode		= 0644,
> +		.proc_handler	= &hugetlb_nodes_allowed_handler,
> +	},
>  #endif
>  	{
>  		.ctl_name	= VM_LOWMEM_RESERVE_RATIO,
> Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c	2009-06-04 12:59:33.000000000 -0400
> +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c	2009-06-04 12:59:35.000000000 -0400
> @@ -1354,19 +1354,27 @@ static ssize_t nr_overcommit_hugepages_s
>  }
>  HSTATE_ATTR(nr_overcommit_hugepages);
>  
> -static ssize_t nodes_allowed_show(struct kobject *kobj,
> -					struct kobj_attribute *attr, char *buf)
> +static int format_hstate_nodes_allowed(struct hstate *h, char *buf,
> +					size_t buflen)
>  {
> -	struct hstate *h = kobj_to_hstate(kobj);
>  	int len = 3;
>  
>  	if (h->nodes_allowed == &node_online_map)
>  		strcpy(buf, "all");
>  	else
> -		len = nodelist_scnprintf(buf, PAGE_SIZE,
> +		len = nodelist_scnprintf(buf, buflen,
>  					*h->nodes_allowed);
> +	return len;
> +
> +}

This looks like unnecessary churn and could have been done in the earlier
patch introducing nodes_allowed_show()

> +
> +static ssize_t nodes_allowed_show(struct kobject *kobj,
> +					struct kobj_attribute *attr, char *buf)
> +{
> +	struct hstate *h = kobj_to_hstate(kobj);
> +	int len =  format_hstate_nodes_allowed(h, buf, PAGE_SIZE);
>  
> -	if (len)
> +	if (len && (len +1) < PAGE_SIZE)
>  		buf[len++] = '\n';
>  
>  	return len;
> @@ -1684,6 +1692,31 @@ int hugetlb_overcommit_handler(struct ct
>  	return 0;
>  }
>  
> +#define NODES_ALLOWED_MAX 64
> +int hugetlb_nodes_allowed_handler(struct ctl_table *table, int write,
> +			struct file *file, void __user *buffer,
> +			size_t *length, loff_t *ppos)
> +{
> +	struct hstate *h = &default_hstate;
> +	int ret = 0;
> +
> +	if (write) {
> +		(void)set_hstate_nodes_allowed(h, buffer, 1);
> +	} else {
> +		char buf[NODES_ALLOWED_MAX];
> +		struct ctl_table tbl = {
> +			.data = buf,
> +			.maxlen = NODES_ALLOWED_MAX,
> +		};
> +		int len =  format_hstate_nodes_allowed(h, buf, sizeof(buf));
> +
> +		if (len)
> +			ret = proc_dostring(&tbl, write, file, buffer,
> +						 length, ppos);
> +	}
> +	return ret;
> +}
> +
>  #endif /* CONFIG_SYSCTL */
>  
>  void hugetlb_report_meminfo(struct seq_file *m)
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/5] Huge Pages Nodes Allowed
  2009-06-17 13:02 ` [PATCH 0/5] Huge Pages Nodes Allowed Mel Gorman
@ 2009-06-17 17:15   ` Lee Schermerhorn
  2009-06-18  9:33     ` Mel Gorman
  2009-06-18 19:08     ` David Rientjes
  0 siblings, 2 replies; 31+ messages in thread
From: Lee Schermerhorn @ 2009-06-17 17:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, akpm, Nishanth Aravamudan, Adam Litke, Andy Whitcroft,
	eric.whitney

On Wed, 2009-06-17 at 14:02 +0100, Mel Gorman wrote:
> On Tue, Jun 16, 2009 at 09:52:28AM -0400, Lee Schermerhorn wrote:
> > Because of assymmetries in some NUMA platforms, and "interesting"
> > topologies emerging in the "scale up x86" world, we have need for
> > better control over the placement of "fresh huge pages".  A while
> > back Nish Aravamundan floated a series of patches to add per node
> > controls for allocating pages to the hugepage pool and removing
> > them.  Nish apparently moved on to other tasks before those patches
> > were accepted.  I have kept a copy of Nish's patches and have
> > intended to rebase and test them and resubmit.
> > 
> > In an [off-list] exchange with Mel Gorman, who admits to knowledge
> > in the huge pages area, I asked his opinion of per node controls
> > for huge pages and he suggested another approach:  using the mempolicy
> > of the task that changes nr_hugepages to constrain the fresh huge
> > page allocations.  I considered this approach but it seemed to me
> > to be a misuse of mempolicy for populating the huge pages free
> > pool. 
> 
> Why would it be a misuse? Fundamentally, the huge page pools are being
> filled by the current process when nr_hugepages is being used. Or are
> you concerned about the specification of hugepages on the kernel command
> line?

Well, "misuse" might be too strong [and I already softened it from
"abuse" :)]--more like "not a good fit, IMO".   I agree that it could be
made to work, but it just didn't "feel" right.  I suppose that we could
use alloc_pages_current() with the necessary gfp flags to specify the
"thisnode"/"exact_node" semantics [no fallback], making sure we didn't
OOM kill the task if a node couldn't satisfy the huge page
allocation, ...  I think it would require major restructuring of
set_max_huge_pages(), ...  Maybe that would be the right thing to do and
maybe we'll end up there, but I was looking for a simpler approach.  

And, yes, the kernel command line was a consideration.  Because of that,
I considered specifying a boot time "nodes_allowed" mask and
constructing a "nodes_allowed" mask at run time from the current task's
policy.  But, it seemed more straightforward to my twisted way of
thinking to make the nodes_allowed mask a bona fide hstate attribute.

> 
> > Interleave policy doesn't have same "this node" semantics
> > that we want
> 
> By "this node" semantics, do you mean allocating from one specific node?

Yes.

> In that case, why would specifying a nodemask of just one node not be
> sufficient?

Maybe along with GFP_THISNODE.  In general, we want to interleave the
persistent huge pages over multiple nodes, but we don't want the
allocations to succeed by falling back to another node.  Then we end up
with unbalanced allocations.  This is how fresh huge page allocation
worked before it was changed to use "THISNODE".


> 
> > and bind policy would require constructing a custom
> > node mask for node as well as addressing OOM, which we don't want
> > during fresh huge page allocation. 
> 
> Would the required mask not already be setup when the process set the
> policy? OOM is not a major concern, it doesn't trigger for failed
> hugepage allocations.

Perhaps not with the current implementation.  However, a certain,
currently shipping, enterprise distro added a "quick and dirty"  [well,
"dirty", anyway :)] implementation of "thisnode" behavior to elimiate
the unwanted fallback behavior and resulting huge pages imbalance by
constructing an ad hoc "MPOL_BIND" policy for allocating huge pages.  We
found that this DOES result in oom kill of the task increasing
nr_hugepages if it encountered a node w/ no available huge pages.

If one used "echo" [usually a shell builtin] to increase nr_hugepages,
it killed off your shell.  sysctl was somewhat better--only killed
sysctl.  The good news was that oom kill just sets a thread info flag,
so the nr_hugepages handler continued to allocate pages around the nodes
and further oom kills were effective no-ops.  When the allocations was
done, the victim task would die.  End users might not consider this
acceptable behavior.

I wasn't sure we'd avoid this situation if we dropped back to using task
mempolicy via, e.g., alloc_pages_current().  So, I thought I'd run this
proposal [nodes_allowed] by you.

> 
> > One could derive a node mask
> > of allowed nodes for huge pages from the mempolicy of the task
> > that is modifying nr_hugepages and use that for fresh huge pages
> > with GFP_THISNODE.  However, if we're not going to use mempolicy
> > directly--e.g., via alloc_page_current() or alloc_page_vma() [with
> > yet another on-stack pseudo-vma :(]--I thought it cleaner to
> > define a "nodes allowed" nodemask for populating the [persistent]
> > huge pages free pool.
> > 
> 
> How about adding alloc_page_mempolicy() that takes the explicit mempolicy
> you need?

Interesting you should mention this.  Somewhat off topic:  I have a
"cleanup" and reorg of shared policy and vma policy that separates
policy lookup from allocation, and adds an "alloc_page_pol()" function.
This is part of my series to generalize shared policy and extend it to
shared, mmap()ed regular files.  That aspect of the series [shared
policy on shared mmaped files] got a lot of push back without any
consideration of the technical details of the patches themselves.
Besides having need/requests for this capability, the resulting cleanup,
removal of all on-stack pseudo-vmas, ..., the series actually seems to
perform [slightly] better on the testing I've done.

I keep this series up to date and hope to repost again sometime with
benchmark results. 


> 
> > This patch series introduces a [per hugepage size] "sysctl",
> > hugepages_nodes_allowed, that specifies a nodemask to constrain
> > the allocation of persistent, fresh huge pages.   The nodemask
> > may be specified by a sysctl, a sysfs huge pages attribute and
> > on the kernel boot command line.  
> > 
> > The series includes a patch to free hugepages from the pool in a
> > "round robin" fashion, interleaved across all on-line nodes to
> > balance the hugepage pool across nodes.  Nish had a patch to do
> > this, too.
> > 
> > Together, these changes don't provide the fine grain of control
> > that per node attributes would. 
> 
> I'm failing to understand at the moment why mem policies set by numactl
> would not do the job for allocation at least. Freeing is a different problem.

They could be made to work.  I actually started coding up a patch to
extract a "nodes allowed" mask from the policy for use with
alloc_fresh_huge_page[_node]() so that I could maintain the overall
structure and use alloc_page_exact_node() with nodes in the allowed
mask.  And, as you say, with some investigation, we may find that we can
use alloc_pages_current() with appropriate flags to achieve the exact
node semantics on each node in the policy.

> 
> > Specifically, there is no easy
> > way to reduce the persistent huge page count for a specific node.
> > I think the degree of control provided by these patches is the
> > minimal necessary and sufficient for managing the persistent the
> > huge page pool.  However, with a bit more reorganization,  we
> > could implement per node controls if others would find that
> > useful.
> > 
> > For more info, see the patch descriptions and the updated kernel
> > hugepages documentation.
> > 
> 

More in response to your comments on the individual patches.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] Free huge pages round robin to balance across nodes
  2009-06-17 13:18   ` Mel Gorman
@ 2009-06-17 17:16     ` Lee Schermerhorn
  2009-06-18 19:08       ` David Rientjes
  0 siblings, 1 reply; 31+ messages in thread
From: Lee Schermerhorn @ 2009-06-17 17:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, akpm, Nishanth Aravamudan, Adam Litke, Andy Whitcroft,
	eric.whitney

On Wed, 2009-06-17 at 14:18 +0100, Mel Gorman wrote:
> On Tue, Jun 16, 2009 at 09:52:36AM -0400, Lee Schermerhorn wrote:
> > [PATCH 1/5] Free huge pages round robin to balance across nodes
> > 
> > Against:  17may09 mmotm
> > 
> > Currently, altho' increasing nr_hugepages will [attempt to]
> > distribute the new huge pages across all nodes in the system,
> > reducing nr_hugepages will free or surplus all free pages
> > from nodes in node id order.  This patch frees huges pages
> > from nodes in round robin fashion in an attempt to keep
> > [persistent] hugepage allocates balanced across the nodes.
> > 
> > New function free_pool_huge_page() is modeled on and
> > performs roughly the inverse of alloc_fresh_huge_page().
> > Replaces dequeue_huge_page() which now has no callers
> > and can be removed.
> > 
> > Helper function hstate_next_to_free_node() uses new hstate
> > member next_to_free_nid to distribute "frees" across all
> > nodes with huge pages.
> > 
> > I placed this patch first in the series because I think it
> > [or something similar] should be applied independent of the
> > rest of the series.  
> > 
> 
> Agreed. Reading though, I can't see any problems with the patch and it
> does make the freeing symmetric with the allocation.
> 
> For clarity though, would it be worth renaming hugetlb_next_nid to
> next_to_alloc_nid so that there is a clear relationship in the
> round-robin allocation and freeing of pages amoung online nodes?

OK.  I'll do that in the next version of the series.

> 
> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> >  include/linux/hugetlb.h |    1 
> >  mm/hugetlb.c            |   68 +++++++++++++++++++++++++++++++++---------------
> >  2 files changed, 48 insertions(+), 21 deletions(-)
> > 
> > Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h
> > ===================================================================
> > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h	2009-06-04 12:59:29.000000000 -0400
> > +++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h	2009-06-04 12:59:31.000000000 -0400
> > @@ -184,6 +184,7 @@ unsigned long hugetlb_get_unmapped_area(
> >  /* Defines one hugetlb page size */
> >  struct hstate {
> >  	int hugetlb_next_nid;
> > +	int next_to_free_nid;
> >  	unsigned int order;
> >  	unsigned long mask;
> >  	unsigned long max_huge_pages;
> > Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c
> > ===================================================================
> > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c	2009-06-04 12:59:29.000000000 -0400
> > +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c	2009-06-04 12:59:31.000000000 -0400
> > @@ -455,24 +455,6 @@ static void enqueue_huge_page(struct hst
> >  	h->free_huge_pages_node[nid]++;
> >  }
> >  
> > -static struct page *dequeue_huge_page(struct hstate *h)
> > -{
> > -	int nid;
> > -	struct page *page = NULL;
> > -
> > -	for (nid = 0; nid < MAX_NUMNODES; ++nid) {
> > -		if (!list_empty(&h->hugepage_freelists[nid])) {
> > -			page = list_entry(h->hugepage_freelists[nid].next,
> > -					  struct page, lru);
> > -			list_del(&page->lru);
> > -			h->free_huge_pages--;
> > -			h->free_huge_pages_node[nid]--;
> > -			break;
> > -		}
> > -	}
> > -	return page;
> > -}
> > -
> >  static struct page *dequeue_huge_page_vma(struct hstate *h,
> >  				struct vm_area_struct *vma,
> >  				unsigned long address, int avoid_reserve)
> > @@ -683,6 +665,52 @@ static int alloc_fresh_huge_page(struct 
> >  	return ret;
> >  }
> >  
> > +/*
> > + * helper for free_pool_huge_page() - find next node
> > + * from which to free a huge page
> > + */
> > +static int hstate_next_to_free_node(struct hstate *h)
> > +{
> > +	int next_nid;
> > +	next_nid = next_node(h->next_to_free_nid, node_online_map);
> > +	if (next_nid == MAX_NUMNODES)
> > +		next_nid = first_node(node_online_map);
> > +	h->next_to_free_nid = next_nid;
> > +	return next_nid;
> > +}
> > +
> > +/*
> > + * Free huge page from pool from next node to free.
> > + * Attempt to keep persistent huge pages more or less
> > + * balanced over allowed nodes.
> > + * Called with hugetlb_lock locked.
> > + */
> > +static int free_pool_huge_page(struct hstate *h)
> > +{
> > +	int start_nid;
> > +	int nid;
> > +	int ret = 0;
> > +
> > +	start_nid = h->next_to_free_nid;
> > +	nid = h->next_to_free_nid;
> > +
> > +	do {
> > +		if (!list_empty(&h->hugepage_freelists[nid])) {
> > +			struct page *page =
> > +				list_entry(h->hugepage_freelists[nid].next,
> > +					  struct page, lru);
> > +			list_del(&page->lru);
> > +			h->free_huge_pages--;
> > +			h->free_huge_pages_node[nid]--;
> > +			update_and_free_page(h, page);
> > +			ret = 1;
> > +		}
> > +		nid = hstate_next_to_free_node(h);
> > +	} while (!ret && nid != start_nid);
> > +
> > +	return ret;
> > +}
> > +
> >  static struct page *alloc_buddy_huge_page(struct hstate *h,
> >  			struct vm_area_struct *vma, unsigned long address)
> >  {
> > @@ -1226,10 +1254,8 @@ static unsigned long set_max_huge_pages(
> >  	min_count = max(count, min_count);
> >  	try_to_free_low(h, min_count);
> >  	while (min_count < persistent_huge_pages(h)) {
> > -		struct page *page = dequeue_huge_page(h);
> > -		if (!page)
> > +		if (!free_pool_huge_page(h))
> >  			break;
> > -		update_and_free_page(h, page);
> >  	}
> >  	while (count < persistent_huge_pages(h)) {
> >  		if (!adjust_pool_surplus(h, 1))
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/5] Add nodes_allowed members to hugepages hstate struct
  2009-06-17 13:35   ` Mel Gorman
@ 2009-06-17 17:38     ` Lee Schermerhorn
  2009-06-18  9:17       ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Lee Schermerhorn @ 2009-06-17 17:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, akpm, Nishanth Aravamudan, Adam Litke, Andy Whitcroft,
	eric.whitney

On Wed, 2009-06-17 at 14:35 +0100, Mel Gorman wrote:
> On Tue, Jun 16, 2009 at 09:52:53AM -0400, Lee Schermerhorn wrote:
> > [PATCH 2/5] add nodes_allowed members to hugepages hstate struct
> > 
> > Against:  17may09 mmotm
> > 
> > This patch adds a nodes_allowed nodemask_t pointer and a
> > __nodes_allowed nodemask_t to the hstate struct for constraining
> > fresh hugepage allocations.  It then adds sysfs attributes and
> > boot command line options to set and [for sysfs attributes] query
> > the allowed nodes mask.
> > 
> > A separate patch will hook up this nodes_allowed mask/pointer to
> > fresh huge page allocation and promoting of surplus pages to
> > persistent.  Another patch will add a 'sysctl' hugepages_nodes_allowed
> > to /proc/sys/vm.
> > 
> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> >  include/linux/hugetlb.h |    2 +
> >  mm/hugetlb.c            |   86 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 88 insertions(+)
> > 
> > Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h
> > ===================================================================
> > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h	2009-06-04 12:59:31.000000000 -0400
> > +++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h	2009-06-04 12:59:32.000000000 -0400
> > @@ -193,6 +193,8 @@ struct hstate {
> >  	unsigned long resv_huge_pages;
> >  	unsigned long surplus_huge_pages;
> >  	unsigned long nr_overcommit_huge_pages;
> > +	nodemask_t *nodes_allowed;
> > +	nodemask_t __nodes_allowed;
> 
> Can you add a comment as to why a nodemask pointer and a nodemask itself
> with very similar field names are both needed?

Hmmm, thought I did that.  Guess I just thought about doing it :(.

It's sort of tied up with why I want to use "all" rather than a literal
node mask to specify allocation [attempts] across all nodes.  The
default and current behavior is to use the node_online_mask.  My
understanding is that this tracks node hot add/remove, and I wanted to
preserve that behavior by pointing at node_online_mask directly, in the
default and nodes_allowed=all cases.


> 
> >  	struct list_head hugepage_freelists[MAX_NUMNODES];
> >  	unsigned int nr_huge_pages_node[MAX_NUMNODES];
> >  	unsigned int free_huge_pages_node[MAX_NUMNODES];
> > Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c
> > ===================================================================
> > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c	2009-06-04 12:59:31.000000000 -0400
> > +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c	2009-06-04 12:59:32.000000000 -0400
> > @@ -40,6 +40,9 @@ __initdata LIST_HEAD(huge_boot_pages);
> >  static struct hstate * __initdata parsed_hstate;
> >  static unsigned long __initdata default_hstate_max_huge_pages;
> >  static unsigned long __initdata default_hstate_size;
> > +static struct hstate __initdata default_boot_hstate = {
> > +	.nodes_allowed = &node_online_map,
> > +};
> >  
> >  #define for_each_hstate(h) \
> >  	for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++)
> > @@ -1102,6 +1105,9 @@ static void __init hugetlb_init_hstates(
> >  	struct hstate *h;
> >  
> >  	for_each_hstate(h) {
> > +		if (!h->nodes_allowed)
> > +			h->nodes_allowed = &node_online_map;
> > +
> 
> The reasoning for two nodes_allowed would appear to be to allow nodes_allowed
> to be some predefined mash or a hstate-private mask. As the mask has to exist
> anyway, can you not just copy it in rather than having a pointer and a mask?
> 
> It would make the check below as to whether all nodes allowed or not a bit
> more expensive but it's not a big deal.

but, it wouldn't track node hot add/remove.  I suppose we could add
handlers to fix up the hstates, but that seemed unnecessary work.

My thinking was that allocation of persistent hugepages [as opposed to
overcommitted/surplus h.p.] is an administrative action and, thus, not a
fast path.  Also, I didn't envision [my myopia, perhaps] a large number
of hstates, so I didn't think the extra pointer per hstate was a big
burden.

> 
> >  		/* oversize hugepages were init'ed in early boot */
> >  		if (h->order < MAX_ORDER)
> >  			hugetlb_hstate_alloc_pages(h);
> > @@ -1335,6 +1341,62 @@ static ssize_t nr_overcommit_hugepages_s
> >  }
> >  HSTATE_ATTR(nr_overcommit_hugepages);
> >  
> > +static ssize_t nodes_allowed_show(struct kobject *kobj,
> > +					struct kobj_attribute *attr, char *buf)
> > +{
> > +	struct hstate *h = kobj_to_hstate(kobj);
> > +	int len = 3;
> > +
> > +	if (h->nodes_allowed == &node_online_map)
> > +		strcpy(buf, "all");
> > +	else
> > +		len = nodelist_scnprintf(buf, PAGE_SIZE,
> > +					*h->nodes_allowed);
> > +
> > +	if (len)
> > +		buf[len++] = '\n';
> > +
> 
> buf doesn't get NULL terminated.

Ah, OK.  Not sure what I was using as an example here.  Maybe I thunk
that up all on my own...

> 
> > +	return len;
> > +}
> 
> Can print_nodes_state() be extended a little to print "all" and then shared
> with here?

Well, I don't know that we'd want to see "all" for the sysfs "online"
nodes attribute.  Some might not find this too informative.  Of course,
even for huge pages nodes allowed, it's not all that informative, but
IMO fits the usage.  One can always look at the sysfs online nodes
attribute to see what "all" means for nodes_allowed.  Couldn't do that
if we changed print_nodes_state().  I guess we could add a flag to
indicate whether we wanted "all" or the actual nodemask.  And, we have
to make print_nodes_state() visible.  Do you think that would be worth
while?  

> 
> > +
> > +static int set_hstate_nodes_allowed(struct hstate *h, const char *buf,
> > +					bool lock)
> > +{
> > +	nodemask_t nodes_allowed;
> > +	int ret = 1;
> > +	bool all = !strncasecmp(buf, "all", 3);
> > +
> > +	if (!all)
> > +		ret = !nodelist_parse(buf, nodes_allowed);
> > +	if (ret) {
> > +		if (lock)
> > +			spin_lock(&hugetlb_lock);
> > +
> 
> ick.
> 
> Convention for something like this is to have set_hstate_nodes_allowed
> that takes a spinlock and __set_hstate_nodes_allowed that is the
> lock-free version and call as appropriate. Can the same be done here?

Sorry.  Yeah, I can do that. 

> 
> For that matter, does taking spinlocks from __setup() break that you
> avoid taking it in that case?

I wasn't sure, but noticed that we weren't locking in the boot path for,
e.g., [nr_]hugepages, so I avoided it here.  I suppose I can just test
whether or not it breaks and, if not, just lock.

> 
> > +		if (all) {
> > +			h->nodes_allowed = &node_online_map;
> > +		} else {
> > +			h->__nodes_allowed = nodes_allowed;
> > +			h->nodes_allowed = &h->__nodes_allowed;
> > +		}
> > +
> > +		if (lock)
> > +			spin_unlock(&hugetlb_lock);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static ssize_t nodes_allowed_store(struct kobject *kobj,
> > +		struct kobj_attribute *attr, const char *buf, size_t count)
> > +{
> > +	struct hstate *h = kobj_to_hstate(kobj);
> > +
> > +	if (set_hstate_nodes_allowed(h, buf, 1))
> > +		return count;
> > +	else
> > +		return 0;
> > +}
> > +HSTATE_ATTR(nodes_allowed);
> > +
> >  static ssize_t free_hugepages_show(struct kobject *kobj,
> >  					struct kobj_attribute *attr, char *buf)
> >  {
> > @@ -1362,6 +1424,7 @@ HSTATE_ATTR_RO(surplus_hugepages);
> >  static struct attribute *hstate_attrs[] = {
> >  	&nr_hugepages_attr.attr,
> >  	&nr_overcommit_hugepages_attr.attr,
> > +	&nodes_allowed_attr.attr,
> >  	&free_hugepages_attr.attr,
> >  	&resv_hugepages_attr.attr,
> >  	&surplus_hugepages_attr.attr,
> > @@ -1436,6 +1499,13 @@ static int __init hugetlb_init(void)
> >  	if (default_hstate_max_huge_pages)
> >  		default_hstate.max_huge_pages = default_hstate_max_huge_pages;
> >  
> > +	if (default_boot_hstate.nodes_allowed != &node_online_map) {
> > +		default_hstate.__nodes_allowed =
> > +					default_boot_hstate.__nodes_allowed;
> > +		default_hstate.nodes_allowed =
> > +					&default_hstate.__nodes_allowed;
> > +	}
> > +
> >  	hugetlb_init_hstates();
> >  
> >  	gather_bootmem_prealloc();
> > @@ -1471,6 +1541,7 @@ void __init hugetlb_add_hstate(unsigned 
> >  	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
> >  					huge_page_size(h)/1024);
> >  
> > +	h->nodes_allowed = &node_online_map;
> >  	parsed_hstate = h;
> >  }
> >  
> > @@ -1518,6 +1589,21 @@ static int __init hugetlb_default_setup(
> >  }
> >  __setup("default_hugepagesz=", hugetlb_default_setup);
> >  
> > +static int __init hugetlb_nodes_allowed_setup(char *s)
> > +{
> > +	struct hstate *h = &default_boot_hstate;
> > +
> > +	/*
> > +	 * max_hstate means we've parsed a hugepagesz= parameter, so
> > +	 * use the [most recently] parsed_hstate.  Else use default.
> > +	 */
> > +	if (max_hstate)
> > +		h = parsed_hstate;
> > +
> > +	return set_hstate_nodes_allowed(h, s, 0);
> > +}
> > +__setup("hugepages_nodes_allowed=", hugetlb_nodes_allowed_setup);
> > +
> >  static unsigned int cpuset_mems_nr(unsigned int *array)
> >  {
> >  	int node;
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] Use per hstate nodes_allowed to constrain huge page allocation
  2009-06-17 13:39   ` Mel Gorman
@ 2009-06-17 17:47     ` Lee Schermerhorn
  2009-06-18  9:18       ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Lee Schermerhorn @ 2009-06-17 17:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, akpm, Nishanth Aravamudan, Adam Litke, Andy Whitcroft,
	eric.whitney

On Wed, 2009-06-17 at 14:39 +0100, Mel Gorman wrote:
> On Tue, Jun 16, 2009 at 09:53:01AM -0400, Lee Schermerhorn wrote:
> > [PATCH 3/5] Use per hstate nodes_allowed to constrain huge page allocation
> > 
> > Against:  17may09 mmotm
> > 
> > Select only nodes from the per hstate nodes_allowed mask when
> > promoting surplus pages to persistent or when allocating fresh
> > huge pages to the pool.
> > 
> > Note that alloc_buddy_huge_page() still uses task policy to allocate
> > surplus huge pages.  This could be changed.
> > 
> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> >  mm/hugetlb.c |   23 ++++++++++++++++++-----
> >  1 file changed, 18 insertions(+), 5 deletions(-)
> > 
> > Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c
> > ===================================================================
> > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c	2009-06-04 12:59:32.000000000 -0400
> > +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c	2009-06-04 12:59:33.000000000 -0400
> > @@ -637,9 +637,9 @@ static struct page *alloc_fresh_huge_pag
> >  static int hstate_next_node(struct hstate *h)
> >  {
> >  	int next_nid;
> > -	next_nid = next_node(h->hugetlb_next_nid, node_online_map);
> > +	next_nid = next_node(h->hugetlb_next_nid, *h->nodes_allowed);
> >  	if (next_nid == MAX_NUMNODES)
> > -		next_nid = first_node(node_online_map);
> > +		next_nid = first_node(*h->nodes_allowed);
> >  	h->hugetlb_next_nid = next_nid;
> >  	return next_nid;
> >  }
> > @@ -652,6 +652,11 @@ static int alloc_fresh_huge_page(struct 
> >  	int ret = 0;
> >  
> >  	start_nid = h->hugetlb_next_nid;
> > +	/*
> > +	 * we may have allocated with a different nodes_allowed previously
> > +	 */
> > +	if (!node_isset(start_nid, *h->nodes_allowed))
> > +		start_nid = hstate_next_node(h);
> >  
> >  	do {
> >  		page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
> > @@ -1169,20 +1174,28 @@ static inline void try_to_free_low(struc
> >  
> >  /*
> >   * Increment or decrement surplus_huge_pages.  Keep node-specific counters
> > - * balanced by operating on them in a round-robin fashion.
> > + * balanced by operating on them in a round-robin fashion.  Use nodes_allowed
> > + * mask when decreasing suplus pages as we're "promoting" them to persistent.
> 
> s/suplus/surplus/

ACK

> 
> > + * Use node_online_map for increment surplus pages as we're demoting previously
> > + * persistent huge pages.
> > + * Called holding the hugetlb_lock.
> >   * Returns 1 if an adjustment was made.
> >   */
> >  static int adjust_pool_surplus(struct hstate *h, int delta)
> >  {
> > +	nodemask_t *nodemask = &node_online_map;
> >  	static int prev_nid;
> >  	int nid = prev_nid;
> >  	int ret = 0;
> >  
> >  	VM_BUG_ON(delta != -1 && delta != 1);
> > +	if (delta < 0)
> > +		nodemask = h->nodes_allowed;
> > +
> 
> Please spell out why nodes_allowed is only used when decreasing the surplus
> count.

I thought my addition to the comment block did that.  

My thinking:  surplus pages are, by definition, in use, so the only time
we decrease them via adjust_pool_surplus() is when we're increasing
nr_hugepages.  New "persistent" huge pages are masked by nodes_allowed,
so promoting surplus pages to persistent should also be so masked.

Conversely, since this series uses the node_online_mask for freeing
available persistent huge pages, I mention that we use it for increasing
the surplus page count.

Should I add more to the comment?

> 
> >  	do {
> > -		nid = next_node(nid, node_online_map);
> > +		nid = next_node(nid, *nodemask);
> >  		if (nid == MAX_NUMNODES)
> > -			nid = first_node(node_online_map);
> > +			nid = first_node(*nodemask);
> >  
> >  		/* To shrink on this node, there must be a surplus page */
> >  		if (delta < 0 && !h->surplus_huge_pages_node[nid])
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/5] Add sysctl for default hstate nodes_allowed.
  2009-06-17 13:41   ` Mel Gorman
@ 2009-06-17 17:52     ` Lee Schermerhorn
  2009-06-18  9:19       ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Lee Schermerhorn @ 2009-06-17 17:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, akpm, Nishanth Aravamudan, Adam Litke, Andy Whitcroft,
	eric.whitney

On Wed, 2009-06-17 at 14:41 +0100, Mel Gorman wrote:
> On Tue, Jun 16, 2009 at 09:53:08AM -0400, Lee Schermerhorn wrote:
> > [PATCH 4/5] add sysctl for default hstate nodes_allowed.
> > 
> > Against:  17may09 mmotm
> > 
> > This patch adds a sysctl -- /proc/sys/vm/hugepages_nodes_allowed --
> > to set/query the default hstate's nodes_allowed.  I don't know
> > that this patch is required, given that we have the per hstate
> > controls in /sys/kernel/mm/hugepages/*. However, we've added sysctls
> > for other recent hugepages controls, like nr_overcommit_hugepages,
> > so I've followed that convention.
> > 
> 
> Yeah, it's somewhat expected that what is in /proc/sys/vm is information
> on the default hugepage size.
> 
> > Factor the formatting of the nodes_allowed mask out of nodes_allowed_show()
> > for use by both that function and the hugetlb_nodes_allowed_handler().
> > 
> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp>
> > 
> >  include/linux/hugetlb.h |    1 +
> >  kernel/sysctl.c         |    8 ++++++++
> >  mm/hugetlb.c            |   43 ++++++++++++++++++++++++++++++++++++++-----
> >  3 files changed, 47 insertions(+), 5 deletions(-)
> > 
> > Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h
> > ===================================================================
> > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h	2009-06-04 12:59:32.000000000 -0400
> > +++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h	2009-06-04 12:59:35.000000000 -0400
> > @@ -22,6 +22,7 @@ void reset_vma_resv_huge_pages(struct vm
> >  int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
> >  int hugetlb_overcommit_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
> >  int hugetlb_treat_movable_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
> > +int hugetlb_nodes_allowed_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
> >  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
> >  int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int, int);
> >  void unmap_hugepage_range(struct vm_area_struct *,
> > Index: linux-2.6.30-rc8-mmotm-090603-1633/kernel/sysctl.c
> > ===================================================================
> > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/kernel/sysctl.c	2009-06-04 12:59:26.000000000 -0400
> > +++ linux-2.6.30-rc8-mmotm-090603-1633/kernel/sysctl.c	2009-06-04 12:59:35.000000000 -0400
> > @@ -1108,6 +1108,14 @@ static struct ctl_table vm_table[] = {
> >  		.extra1		= (void *)&hugetlb_zero,
> >  		.extra2		= (void *)&hugetlb_infinity,
> >  	},
> > +	{
> > +		.ctl_name	= CTL_UNNUMBERED,
> > +		.procname	= "hugepages_nodes_allowed",
> > +		.data		= NULL,
> > +		.maxlen		= sizeof(unsigned long),
> > +		.mode		= 0644,
> > +		.proc_handler	= &hugetlb_nodes_allowed_handler,
> > +	},
> >  #endif
> >  	{
> >  		.ctl_name	= VM_LOWMEM_RESERVE_RATIO,
> > Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c
> > ===================================================================
> > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c	2009-06-04 12:59:33.000000000 -0400
> > +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c	2009-06-04 12:59:35.000000000 -0400
> > @@ -1354,19 +1354,27 @@ static ssize_t nr_overcommit_hugepages_s
> >  }
> >  HSTATE_ATTR(nr_overcommit_hugepages);
> >  
> > -static ssize_t nodes_allowed_show(struct kobject *kobj,
> > -					struct kobj_attribute *attr, char *buf)
> > +static int format_hstate_nodes_allowed(struct hstate *h, char *buf,
> > +					size_t buflen)
> >  {
> > -	struct hstate *h = kobj_to_hstate(kobj);
> >  	int len = 3;
> >  
> >  	if (h->nodes_allowed == &node_online_map)
> >  		strcpy(buf, "all");
> >  	else
> > -		len = nodelist_scnprintf(buf, PAGE_SIZE,
> > +		len = nodelist_scnprintf(buf, buflen,
> >  					*h->nodes_allowed);
> > +	return len;
> > +
> > +}
> 
> This looks like unnecessary churn and could have been done in the earlier
> patch introducing nodes_allowed_show()

Yes.  I couldn't see creating the separate function if we weren't going
to have the sysctl also formatting the nodes_allowed.  If all agree that
we want the sysctl, I can fold this patch in with patch 2/5.

> 
> > +
> > +static ssize_t nodes_allowed_show(struct kobject *kobj,
> > +					struct kobj_attribute *attr, char *buf)
> > +{
> > +	struct hstate *h = kobj_to_hstate(kobj);
> > +	int len =  format_hstate_nodes_allowed(h, buf, PAGE_SIZE);
> >  
> > -	if (len)
> > +	if (len && (len +1) < PAGE_SIZE)
> >  		buf[len++] = '\n';
> >  
> >  	return len;
> > @@ -1684,6 +1692,31 @@ int hugetlb_overcommit_handler(struct ct
> >  	return 0;
> >  }
> >  
> > +#define NODES_ALLOWED_MAX 64
> > +int hugetlb_nodes_allowed_handler(struct ctl_table *table, int write,
> > +			struct file *file, void __user *buffer,
> > +			size_t *length, loff_t *ppos)
> > +{
> > +	struct hstate *h = &default_hstate;
> > +	int ret = 0;
> > +
> > +	if (write) {
> > +		(void)set_hstate_nodes_allowed(h, buffer, 1);
> > +	} else {
> > +		char buf[NODES_ALLOWED_MAX];
> > +		struct ctl_table tbl = {
> > +			.data = buf,
> > +			.maxlen = NODES_ALLOWED_MAX,
> > +		};
> > +		int len =  format_hstate_nodes_allowed(h, buf, sizeof(buf));
> > +
> > +		if (len)
> > +			ret = proc_dostring(&tbl, write, file, buffer,
> > +						 length, ppos);
> > +	}
> > +	return ret;
> > +}
> > +
> >  #endif /* CONFIG_SYSCTL */
> >  
> >  void hugetlb_report_meminfo(struct seq_file *m)
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/5] Add nodes_allowed members to hugepages hstate struct
  2009-06-17 17:38     ` Lee Schermerhorn
@ 2009-06-18  9:17       ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2009-06-18  9:17 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Nishanth Aravamudan, Adam Litke, Andy Whitcroft,
	eric.whitney

On Wed, Jun 17, 2009 at 01:38:16PM -0400, Lee Schermerhorn wrote:
> On Wed, 2009-06-17 at 14:35 +0100, Mel Gorman wrote:
> > On Tue, Jun 16, 2009 at 09:52:53AM -0400, Lee Schermerhorn wrote:
> > > [PATCH 2/5] add nodes_allowed members to hugepages hstate struct
> > > 
> > > Against:  17may09 mmotm
> > > 
> > > This patch adds a nodes_allowed nodemask_t pointer and a
> > > __nodes_allowed nodemask_t to the hstate struct for constraining
> > > fresh hugepage allocations.  It then adds sysfs attributes and
> > > boot command line options to set and [for sysfs attributes] query
> > > the allowed nodes mask.
> > > 
> > > A separate patch will hook up this nodes_allowed mask/pointer to
> > > fresh huge page allocation and promoting of surplus pages to
> > > persistent.  Another patch will add a 'sysctl' hugepages_nodes_allowed
> > > to /proc/sys/vm.
> > > 
> > > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > > 
> > >  include/linux/hugetlb.h |    2 +
> > >  mm/hugetlb.c            |   86 ++++++++++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 88 insertions(+)
> > > 
> > > Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h
> > > ===================================================================
> > > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h	2009-06-04 12:59:31.000000000 -0400
> > > +++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h	2009-06-04 12:59:32.000000000 -0400
> > > @@ -193,6 +193,8 @@ struct hstate {
> > >  	unsigned long resv_huge_pages;
> > >  	unsigned long surplus_huge_pages;
> > >  	unsigned long nr_overcommit_huge_pages;
> > > +	nodemask_t *nodes_allowed;
> > > +	nodemask_t __nodes_allowed;
> > 
> > Can you add a comment as to why a nodemask pointer and a nodemask itself
> > with very similar field names are both needed?
> 
> Hmmm, thought I did that.  Guess I just thought about doing it :(.
> 
> It's sort of tied up with why I want to use "all" rather than a literal
> node mask to specify allocation [attempts] across all nodes.  The
> default and current behavior is to use the node_online_mask.  My
> understanding is that this tracks node hot add/remove, and I wanted to
> preserve that behavior by pointing at node_online_mask directly, in the
> default and nodes_allowed=all cases.
> 

Good point, I had not considered node hot add/remove. Any chance you
could kmalloc() the nodes_allowed() when the mask is set? It just feels
remarkably untidy to have two nodes_allowed like this, particularly as
the __nodes_allowed will not be used in the majority of cases.

> 
> > 
> > >  	struct list_head hugepage_freelists[MAX_NUMNODES];
> > >  	unsigned int nr_huge_pages_node[MAX_NUMNODES];
> > >  	unsigned int free_huge_pages_node[MAX_NUMNODES];
> > > Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c
> > > ===================================================================
> > > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c	2009-06-04 12:59:31.000000000 -0400
> > > +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c	2009-06-04 12:59:32.000000000 -0400
> > > @@ -40,6 +40,9 @@ __initdata LIST_HEAD(huge_boot_pages);
> > >  static struct hstate * __initdata parsed_hstate;
> > >  static unsigned long __initdata default_hstate_max_huge_pages;
> > >  static unsigned long __initdata default_hstate_size;
> > > +static struct hstate __initdata default_boot_hstate = {
> > > +	.nodes_allowed = &node_online_map,
> > > +};
> > >  
> > >  #define for_each_hstate(h) \
> > >  	for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++)
> > > @@ -1102,6 +1105,9 @@ static void __init hugetlb_init_hstates(
> > >  	struct hstate *h;
> > >  
> > >  	for_each_hstate(h) {
> > > +		if (!h->nodes_allowed)
> > > +			h->nodes_allowed = &node_online_map;
> > > +
> > 
> > The reasoning for two nodes_allowed would appear to be to allow nodes_allowed
> > to be some predefined mash or a hstate-private mask. As the mask has to exist
> > anyway, can you not just copy it in rather than having a pointer and a mask?
> > 
> > It would make the check below as to whether all nodes allowed or not a bit
> > more expensive but it's not a big deal.
> 
> but, it wouldn't track node hot add/remove.  I suppose we could add
> handlers to fix up the hstates, but that seemed unnecessary work.
> 

Right.

> My thinking was that allocation of persistent hugepages [as opposed to
> overcommitted/surplus h.p.] is an administrative action and, thus, not a
> fast path.  Also, I didn't envision [my myopia, perhaps] a large number
> of hstates, so I didn't think the extra pointer per hstate was a big
> burden.
> 

It's not a big memory burden, it just seems untidy and there are a lot
of untidy anomolies in hugetlbfs as it is. While I'm far from innocent,
I hate to see it getting even odder looking :)

> > 
> > >  		/* oversize hugepages were init'ed in early boot */
> > >  		if (h->order < MAX_ORDER)
> > >  			hugetlb_hstate_alloc_pages(h);
> > > @@ -1335,6 +1341,62 @@ static ssize_t nr_overcommit_hugepages_s
> > >  }
> > >  HSTATE_ATTR(nr_overcommit_hugepages);
> > >  
> > > +static ssize_t nodes_allowed_show(struct kobject *kobj,
> > > +					struct kobj_attribute *attr, char *buf)
> > > +{
> > > +	struct hstate *h = kobj_to_hstate(kobj);
> > > +	int len = 3;
> > > +
> > > +	if (h->nodes_allowed == &node_online_map)
> > > +		strcpy(buf, "all");
> > > +	else
> > > +		len = nodelist_scnprintf(buf, PAGE_SIZE,
> > > +					*h->nodes_allowed);
> > > +
> > > +	if (len)
> > > +		buf[len++] = '\n';
> > > +
> > 
> > buf doesn't get NULL terminated.
> 
> Ah, OK.  Not sure what I was using as an example here.  Maybe I thunk
> that up all on my own...
> 
> > 
> > > +	return len;
> > > +}
> > 
> > Can print_nodes_state() be extended a little to print "all" and then shared
> > with here?
> 
> Well, I don't know that we'd want to see "all" for the sysfs "online"
> nodes attribute.  Some might not find this too informative.  Of course,
> even for huge pages nodes allowed, it's not all that informative, but
> IMO fits the usage.  One can always look at the sysfs online nodes
> attribute to see what "all" means for nodes_allowed.  Couldn't do that
> if we changed print_nodes_state().  I guess we could add a flag to
> indicate whether we wanted "all" or the actual nodemask.  And, we have
> to make print_nodes_state() visible.  Do you think that would be worth
> while?  
> 

I see your point. Maybe, maybe not. The suggestion was due to the bugs in
the new version than anything else. I figured it would be easier to get one
version right than have multiple slightly different versions. How attached
are you to having "all" printed? There is an arguement for having nodemasks
reported via sysfs all looking the same

> > 
> > > +
> > > +static int set_hstate_nodes_allowed(struct hstate *h, const char *buf,
> > > +					bool lock)
> > > +{
> > > +	nodemask_t nodes_allowed;
> > > +	int ret = 1;
> > > +	bool all = !strncasecmp(buf, "all", 3);
> > > +
> > > +	if (!all)
> > > +		ret = !nodelist_parse(buf, nodes_allowed);
> > > +	if (ret) {
> > > +		if (lock)
> > > +			spin_lock(&hugetlb_lock);
> > > +
> > 
> > ick.
> > 
> > Convention for something like this is to have set_hstate_nodes_allowed
> > that takes a spinlock and __set_hstate_nodes_allowed that is the
> > lock-free version and call as appropriate. Can the same be done here?
> 
> Sorry.  Yeah, I can do that. 
> 
> > 
> > For that matter, does taking spinlocks from __setup() break that you
> > avoid taking it in that case?
> 
> I wasn't sure, but noticed that we weren't locking in the boot path for,
> e.g., [nr_]hugepages, so I avoided it here.  I suppose I can just test
> whether or not it breaks and, if not, just lock.
> 

A lack of locking in the bootpath is more likely due to laziness than with
any problems taking the locks.

> > 
> > > +		if (all) {
> > > +			h->nodes_allowed = &node_online_map;
> > > +		} else {
> > > +			h->__nodes_allowed = nodes_allowed;
> > > +			h->nodes_allowed = &h->__nodes_allowed;
> > > +		}
> > > +
> > > +		if (lock)
> > > +			spin_unlock(&hugetlb_lock);
> > > +	}
> > > +	return ret;
> > > +}
> > > +
> > > +static ssize_t nodes_allowed_store(struct kobject *kobj,
> > > +		struct kobj_attribute *attr, const char *buf, size_t count)
> > > +{
> > > +	struct hstate *h = kobj_to_hstate(kobj);
> > > +
> > > +	if (set_hstate_nodes_allowed(h, buf, 1))
> > > +		return count;
> > > +	else
> > > +		return 0;
> > > +}
> > > +HSTATE_ATTR(nodes_allowed);
> > > +
> > >  static ssize_t free_hugepages_show(struct kobject *kobj,
> > >  					struct kobj_attribute *attr, char *buf)
> > >  {
> > > @@ -1362,6 +1424,7 @@ HSTATE_ATTR_RO(surplus_hugepages);
> > >  static struct attribute *hstate_attrs[] = {
> > >  	&nr_hugepages_attr.attr,
> > >  	&nr_overcommit_hugepages_attr.attr,
> > > +	&nodes_allowed_attr.attr,
> > >  	&free_hugepages_attr.attr,
> > >  	&resv_hugepages_attr.attr,
> > >  	&surplus_hugepages_attr.attr,
> > > @@ -1436,6 +1499,13 @@ static int __init hugetlb_init(void)
> > >  	if (default_hstate_max_huge_pages)
> > >  		default_hstate.max_huge_pages = default_hstate_max_huge_pages;
> > >  
> > > +	if (default_boot_hstate.nodes_allowed != &node_online_map) {
> > > +		default_hstate.__nodes_allowed =
> > > +					default_boot_hstate.__nodes_allowed;
> > > +		default_hstate.nodes_allowed =
> > > +					&default_hstate.__nodes_allowed;
> > > +	}
> > > +
> > >  	hugetlb_init_hstates();
> > >  
> > >  	gather_bootmem_prealloc();
> > > @@ -1471,6 +1541,7 @@ void __init hugetlb_add_hstate(unsigned 
> > >  	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
> > >  					huge_page_size(h)/1024);
> > >  
> > > +	h->nodes_allowed = &node_online_map;
> > >  	parsed_hstate = h;
> > >  }
> > >  
> > > @@ -1518,6 +1589,21 @@ static int __init hugetlb_default_setup(
> > >  }
> > >  __setup("default_hugepagesz=", hugetlb_default_setup);
> > >  
> > > +static int __init hugetlb_nodes_allowed_setup(char *s)
> > > +{
> > > +	struct hstate *h = &default_boot_hstate;
> > > +
> > > +	/*
> > > +	 * max_hstate means we've parsed a hugepagesz= parameter, so
> > > +	 * use the [most recently] parsed_hstate.  Else use default.
> > > +	 */
> > > +	if (max_hstate)
> > > +		h = parsed_hstate;
> > > +
> > > +	return set_hstate_nodes_allowed(h, s, 0);
> > > +}
> > > +__setup("hugepages_nodes_allowed=", hugetlb_nodes_allowed_setup);
> > > +
> > >  static unsigned int cpuset_mems_nr(unsigned int *array)
> > >  {
> > >  	int node;
> > > 
> > 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] Use per hstate nodes_allowed to constrain huge page allocation
  2009-06-17 17:47     ` Lee Schermerhorn
@ 2009-06-18  9:18       ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2009-06-18  9:18 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Nishanth Aravamudan, Adam Litke, Andy Whitcroft,
	eric.whitney

On Wed, Jun 17, 2009 at 01:47:03PM -0400, Lee Schermerhorn wrote:
> On Wed, 2009-06-17 at 14:39 +0100, Mel Gorman wrote:
> > On Tue, Jun 16, 2009 at 09:53:01AM -0400, Lee Schermerhorn wrote:
> > > [PATCH 3/5] Use per hstate nodes_allowed to constrain huge page allocation
> > > 
> > > Against:  17may09 mmotm
> > > 
> > > Select only nodes from the per hstate nodes_allowed mask when
> > > promoting surplus pages to persistent or when allocating fresh
> > > huge pages to the pool.
> > > 
> > > Note that alloc_buddy_huge_page() still uses task policy to allocate
> > > surplus huge pages.  This could be changed.
> > > 
> > > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > > 
> > >  mm/hugetlb.c |   23 ++++++++++++++++++-----
> > >  1 file changed, 18 insertions(+), 5 deletions(-)
> > > 
> > > Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c
> > > ===================================================================
> > > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c	2009-06-04 12:59:32.000000000 -0400
> > > +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c	2009-06-04 12:59:33.000000000 -0400
> > > @@ -637,9 +637,9 @@ static struct page *alloc_fresh_huge_pag
> > >  static int hstate_next_node(struct hstate *h)
> > >  {
> > >  	int next_nid;
> > > -	next_nid = next_node(h->hugetlb_next_nid, node_online_map);
> > > +	next_nid = next_node(h->hugetlb_next_nid, *h->nodes_allowed);
> > >  	if (next_nid == MAX_NUMNODES)
> > > -		next_nid = first_node(node_online_map);
> > > +		next_nid = first_node(*h->nodes_allowed);
> > >  	h->hugetlb_next_nid = next_nid;
> > >  	return next_nid;
> > >  }
> > > @@ -652,6 +652,11 @@ static int alloc_fresh_huge_page(struct 
> > >  	int ret = 0;
> > >  
> > >  	start_nid = h->hugetlb_next_nid;
> > > +	/*
> > > +	 * we may have allocated with a different nodes_allowed previously
> > > +	 */
> > > +	if (!node_isset(start_nid, *h->nodes_allowed))
> > > +		start_nid = hstate_next_node(h);
> > >  
> > >  	do {
> > >  		page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
> > > @@ -1169,20 +1174,28 @@ static inline void try_to_free_low(struc
> > >  
> > >  /*
> > >   * Increment or decrement surplus_huge_pages.  Keep node-specific counters
> > > - * balanced by operating on them in a round-robin fashion.
> > > + * balanced by operating on them in a round-robin fashion.  Use nodes_allowed
> > > + * mask when decreasing suplus pages as we're "promoting" them to persistent.
> > 
> > s/suplus/surplus/
> 
> ACK
> 
> > 
> > > + * Use node_online_map for increment surplus pages as we're demoting previously
> > > + * persistent huge pages.
> > > + * Called holding the hugetlb_lock.
> > >   * Returns 1 if an adjustment was made.
> > >   */
> > >  static int adjust_pool_surplus(struct hstate *h, int delta)
> > >  {
> > > +	nodemask_t *nodemask = &node_online_map;
> > >  	static int prev_nid;
> > >  	int nid = prev_nid;
> > >  	int ret = 0;
> > >  
> > >  	VM_BUG_ON(delta != -1 && delta != 1);
> > > +	if (delta < 0)
> > > +		nodemask = h->nodes_allowed;
> > > +
> > 
> > Please spell out why nodes_allowed is only used when decreasing the surplus
> > count.
> 
> I thought my addition to the comment block did that.  
> 
> My thinking:  surplus pages are, by definition, in use, so the only time
> we decrease them via adjust_pool_surplus() is when we're increasing
> nr_hugepages.  New "persistent" huge pages are masked by nodes_allowed,
> so promoting surplus pages to persistent should also be so masked.
> 

This in addition to the existing comment is more than adequate.

> Conversely, since this series uses the node_online_mask for freeing
> available persistent huge pages, I mention that we use it for increasing
> the surplus page count.
> 
> Should I add more to the comment?
> 

Do please.

> > 
> > >  	do {
> > > -		nid = next_node(nid, node_online_map);
> > > +		nid = next_node(nid, *nodemask);
> > >  		if (nid == MAX_NUMNODES)
> > > -			nid = first_node(node_online_map);
> > > +			nid = first_node(*nodemask);
> > >  
> > >  		/* To shrink on this node, there must be a surplus page */
> > >  		if (delta < 0 && !h->surplus_huge_pages_node[nid])
> > > 
> > 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/5] Add sysctl for default hstate nodes_allowed.
  2009-06-17 17:52     ` Lee Schermerhorn
@ 2009-06-18  9:19       ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2009-06-18  9:19 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Nishanth Aravamudan, Adam Litke, Andy Whitcroft,
	eric.whitney

On Wed, Jun 17, 2009 at 01:52:02PM -0400, Lee Schermerhorn wrote:
> On Wed, 2009-06-17 at 14:41 +0100, Mel Gorman wrote:
> > On Tue, Jun 16, 2009 at 09:53:08AM -0400, Lee Schermerhorn wrote:
> > > [PATCH 4/5] add sysctl for default hstate nodes_allowed.
> > > 
> > > Against:  17may09 mmotm
> > > 
> > > This patch adds a sysctl -- /proc/sys/vm/hugepages_nodes_allowed --
> > > to set/query the default hstate's nodes_allowed.  I don't know
> > > that this patch is required, given that we have the per hstate
> > > controls in /sys/kernel/mm/hugepages/*. However, we've added sysctls
> > > for other recent hugepages controls, like nr_overcommit_hugepages,
> > > so I've followed that convention.
> > > 
> > 
> > Yeah, it's somewhat expected that what is in /proc/sys/vm is information
> > on the default hugepage size.
> > 
> > > Factor the formatting of the nodes_allowed mask out of nodes_allowed_show()
> > > for use by both that function and the hugetlb_nodes_allowed_handler().
> > > 
> > > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp>
> > > 
> > >  include/linux/hugetlb.h |    1 +
> > >  kernel/sysctl.c         |    8 ++++++++
> > >  mm/hugetlb.c            |   43 ++++++++++++++++++++++++++++++++++++++-----
> > >  3 files changed, 47 insertions(+), 5 deletions(-)
> > > 
> > > Index: linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h
> > > ===================================================================
> > > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/include/linux/hugetlb.h	2009-06-04 12:59:32.000000000 -0400
> > > +++ linux-2.6.30-rc8-mmotm-090603-1633/include/linux/hugetlb.h	2009-06-04 12:59:35.000000000 -0400
> > > @@ -22,6 +22,7 @@ void reset_vma_resv_huge_pages(struct vm
> > >  int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
> > >  int hugetlb_overcommit_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
> > >  int hugetlb_treat_movable_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
> > > +int hugetlb_nodes_allowed_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
> > >  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
> > >  int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int, int);
> > >  void unmap_hugepage_range(struct vm_area_struct *,
> > > Index: linux-2.6.30-rc8-mmotm-090603-1633/kernel/sysctl.c
> > > ===================================================================
> > > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/kernel/sysctl.c	2009-06-04 12:59:26.000000000 -0400
> > > +++ linux-2.6.30-rc8-mmotm-090603-1633/kernel/sysctl.c	2009-06-04 12:59:35.000000000 -0400
> > > @@ -1108,6 +1108,14 @@ static struct ctl_table vm_table[] = {
> > >  		.extra1		= (void *)&hugetlb_zero,
> > >  		.extra2		= (void *)&hugetlb_infinity,
> > >  	},
> > > +	{
> > > +		.ctl_name	= CTL_UNNUMBERED,
> > > +		.procname	= "hugepages_nodes_allowed",
> > > +		.data		= NULL,
> > > +		.maxlen		= sizeof(unsigned long),
> > > +		.mode		= 0644,
> > > +		.proc_handler	= &hugetlb_nodes_allowed_handler,
> > > +	},
> > >  #endif
> > >  	{
> > >  		.ctl_name	= VM_LOWMEM_RESERVE_RATIO,
> > > Index: linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c
> > > ===================================================================
> > > --- linux-2.6.30-rc8-mmotm-090603-1633.orig/mm/hugetlb.c	2009-06-04 12:59:33.000000000 -0400
> > > +++ linux-2.6.30-rc8-mmotm-090603-1633/mm/hugetlb.c	2009-06-04 12:59:35.000000000 -0400
> > > @@ -1354,19 +1354,27 @@ static ssize_t nr_overcommit_hugepages_s
> > >  }
> > >  HSTATE_ATTR(nr_overcommit_hugepages);
> > >  
> > > -static ssize_t nodes_allowed_show(struct kobject *kobj,
> > > -					struct kobj_attribute *attr, char *buf)
> > > +static int format_hstate_nodes_allowed(struct hstate *h, char *buf,
> > > +					size_t buflen)
> > >  {
> > > -	struct hstate *h = kobj_to_hstate(kobj);
> > >  	int len = 3;
> > >  
> > >  	if (h->nodes_allowed == &node_online_map)
> > >  		strcpy(buf, "all");
> > >  	else
> > > -		len = nodelist_scnprintf(buf, PAGE_SIZE,
> > > +		len = nodelist_scnprintf(buf, buflen,
> > >  					*h->nodes_allowed);
> > > +	return len;
> > > +
> > > +}
> > 
> > This looks like unnecessary churn and could have been done in the earlier
> > patch introducing nodes_allowed_show()
> 
> Yes.  I couldn't see creating the separate function if we weren't going
> to have the sysctl also formatting the nodes_allowed.  If all agree that
> we want the sysctl, I can fold this patch in with patch 2/5.
> 

I think we want the sysctl to be consistent with nr_hugepages and the
other hugepages parameters.

> > 
> > > +
> > > +static ssize_t nodes_allowed_show(struct kobject *kobj,
> > > +					struct kobj_attribute *attr, char *buf)
> > > +{
> > > +	struct hstate *h = kobj_to_hstate(kobj);
> > > +	int len =  format_hstate_nodes_allowed(h, buf, PAGE_SIZE);
> > >  
> > > -	if (len)
> > > +	if (len && (len +1) < PAGE_SIZE)
> > >  		buf[len++] = '\n';
> > >  
> > >  	return len;
> > > @@ -1684,6 +1692,31 @@ int hugetlb_overcommit_handler(struct ct
> > >  	return 0;
> > >  }
> > >  
> > > +#define NODES_ALLOWED_MAX 64
> > > +int hugetlb_nodes_allowed_handler(struct ctl_table *table, int write,
> > > +			struct file *file, void __user *buffer,
> > > +			size_t *length, loff_t *ppos)
> > > +{
> > > +	struct hstate *h = &default_hstate;
> > > +	int ret = 0;
> > > +
> > > +	if (write) {
> > > +		(void)set_hstate_nodes_allowed(h, buffer, 1);
> > > +	} else {
> > > +		char buf[NODES_ALLOWED_MAX];
> > > +		struct ctl_table tbl = {
> > > +			.data = buf,
> > > +			.maxlen = NODES_ALLOWED_MAX,
> > > +		};
> > > +		int len =  format_hstate_nodes_allowed(h, buf, sizeof(buf));
> > > +
> > > +		if (len)
> > > +			ret = proc_dostring(&tbl, write, file, buffer,
> > > +						 length, ppos);
> > > +	}
> > > +	return ret;
> > > +}
> > > +
> > >  #endif /* CONFIG_SYSCTL */
> > >  
> > >  void hugetlb_report_meminfo(struct seq_file *m)
> > > 
> > 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/5] Huge Pages Nodes Allowed
  2009-06-17 17:15   ` Lee Schermerhorn
@ 2009-06-18  9:33     ` Mel Gorman
  2009-06-18 14:46       ` Lee Schermerhorn
  2009-06-18 19:08     ` David Rientjes
  1 sibling, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2009-06-18  9:33 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Nishanth Aravamudan, Adam Litke, Andy Whitcroft,
	eric.whitney

On Wed, Jun 17, 2009 at 01:15:54PM -0400, Lee Schermerhorn wrote:
> On Wed, 2009-06-17 at 14:02 +0100, Mel Gorman wrote:
> > On Tue, Jun 16, 2009 at 09:52:28AM -0400, Lee Schermerhorn wrote:
> > > Because of assymmetries in some NUMA platforms, and "interesting"
> > > topologies emerging in the "scale up x86" world, we have need for
> > > better control over the placement of "fresh huge pages".  A while
> > > back Nish Aravamundan floated a series of patches to add per node
> > > controls for allocating pages to the hugepage pool and removing
> > > them.  Nish apparently moved on to other tasks before those patches
> > > were accepted.  I have kept a copy of Nish's patches and have
> > > intended to rebase and test them and resubmit.
> > > 
> > > In an [off-list] exchange with Mel Gorman, who admits to knowledge
> > > in the huge pages area, I asked his opinion of per node controls
> > > for huge pages and he suggested another approach:  using the mempolicy
> > > of the task that changes nr_hugepages to constrain the fresh huge
> > > page allocations.  I considered this approach but it seemed to me
> > > to be a misuse of mempolicy for populating the huge pages free
> > > pool. 
> > 
> > Why would it be a misuse? Fundamentally, the huge page pools are being
> > filled by the current process when nr_hugepages is being used. Or are
> > you concerned about the specification of hugepages on the kernel command
> > line?
> 
> Well, "misuse" might be too strong [and I already softened it from
> "abuse" :)]--more like "not a good fit, IMO".   I agree that it could be
> made to work, but it just didn't "feel" right.  I suppose that we could
> use alloc_pages_current() with the necessary gfp flags to specify the
> "thisnode"/"exact_node" semantics [no fallback], making sure we didn't
> OOM kill the task if a node couldn't satisfy the huge page
> allocation, ...  I think it would require major restructuring of
> set_max_huge_pages(), ...  Maybe that would be the right thing to do and
> maybe we'll end up there, but I was looking for a simpler approach.  
> 

Ultimately I think it's the right way of doing things. hugepages in the
past had a large number of "special" considerations in comparison to the
core VM. We've removed a fair few of the special cases in the last year
and I'd rather avoid introducing more of them.

If I was as an administrator, I would prefer being able to do allocate
hugepages only on two nodes with

# numactl --interleave 0,2 hugeadm --pool-pages-min=2M:128

instead of

# echo some_magic > /proc/sys/vm/hugepage_nodes_allowed
# hugeadm --pool-pages-min=2M:128

hugeadm would have to be taught about the new sysctl of course if it's created
but it still feels more awkward than it should be and numactl should already
be familar.

> And, yes, the kernel command line was a consideration.  Because of that,
> I considered specifying a boot time "nodes_allowed" mask and
> constructing a "nodes_allowed" mask at run time from the current task's
> policy.  But, it seemed more straightforward to my twisted way of
> thinking to make the nodes_allowed mask a bona fide hstate attribute.
> 

I see scope for allowing the nodemask to be specified at boot-time all right,
heck it might even be required! However, I am somewhat attached to the idea
of static hugepage allocation obeying memory policies.

> > 
> > > Interleave policy doesn't have same "this node" semantics
> > > that we want
> > 
> > By "this node" semantics, do you mean allocating from one specific node?
> 
> Yes.
> 
> > In that case, why would specifying a nodemask of just one node not be
> > sufficient?
> 
> Maybe along with GFP_THISNODE. 

But a nodemask of just one node might as well be GFP_THISNODE, right?

> In general, we want to interleave the
> persistent huge pages over multiple nodes, but we don't want the
> allocations to succeed by falling back to another node.  Then we end up
> with unbalanced allocations.  This is how fresh huge page allocation
> worked before it was changed to use "THISNODE".
> 
> 
> > 
> > > and bind policy would require constructing a custom
> > > node mask for node as well as addressing OOM, which we don't want
> > > during fresh huge page allocation. 
> > 
> > Would the required mask not already be setup when the process set the
> > policy? OOM is not a major concern, it doesn't trigger for failed
> > hugepage allocations.
> 
> Perhaps not with the current implementation.  However, a certain,
> currently shipping, enterprise distro added a "quick and dirty"  [well,
> "dirty", anyway :)] implementation of "thisnode" behavior to elimiate
> the unwanted fallback behavior and resulting huge pages imbalance by
> constructing an ad hoc "MPOL_BIND" policy for allocating huge pages.  We
> found that this DOES result in oom kill of the task increasing
> nr_hugepages if it encountered a node w/ no available huge pages.
> 

Fun.

> If one used "echo" [usually a shell builtin] to increase nr_hugepages,
> it killed off your shell.  sysctl was somewhat better--only killed
> sysctl.  The good news was that oom kill just sets a thread info flag,
> so the nr_hugepages handler continued to allocate pages around the nodes
> and further oom kills were effective no-ops.  When the allocations was
> done, the victim task would die.  End users might not consider this
> acceptable behavior.
> 

No. My current expectation is that we've handled the vast majority of
cases where processes could get unexpectedly killed just because they
were looking funny at hugepages.

> I wasn't sure we'd avoid this situation if we dropped back to using task
> mempolicy via, e.g., alloc_pages_current().  So, I thought I'd run this
> proposal [nodes_allowed] by you.
> 

I'm not wholesale against it. Minimally, I see scope for specifying the
nodemask at boot-time but I'm less sold on the sysctl at the moment because
I'm not convinced that applying memory policies when sizing the hugepage pool
cannot be made work to solve your problem.

> > 
> > > One could derive a node mask
> > > of allowed nodes for huge pages from the mempolicy of the task
> > > that is modifying nr_hugepages and use that for fresh huge pages
> > > with GFP_THISNODE.  However, if we're not going to use mempolicy
> > > directly--e.g., via alloc_page_current() or alloc_page_vma() [with
> > > yet another on-stack pseudo-vma :(]--I thought it cleaner to
> > > define a "nodes allowed" nodemask for populating the [persistent]
> > > huge pages free pool.
> > > 
> > 
> > How about adding alloc_page_mempolicy() that takes the explicit mempolicy
> > you need?
> 
> Interesting you should mention this.  Somewhat off topic:  I have a
> "cleanup" and reorg of shared policy and vma policy that separates
> policy lookup from allocation, and adds an "alloc_page_pol()" function.
> This is part of my series to generalize shared policy and extend it to
> shared, mmap()ed regular files.  That aspect of the series [shared
> policy on shared mmaped files] got a lot of push back without any
> consideration of the technical details of the patches themselves.

Those arguements feel vaguely familiar. I think I read the thread although
the specifics of the objections escape me. Think there was something about
non-determinism if two processes shared a mapping with different policies
and no idea which should take precedence.

> Besides having need/requests for this capability, the resulting cleanup,
> removal of all on-stack pseudo-vmas, ..., the series actually seems to
> perform [slightly] better on the testing I've done.
> 
> I keep this series up to date and hope to repost again sometime with
> benchmark results. 
> 

Sounds like a good idea. If the cleanup yielded an alloc_page_pol()
function, it might make the patch for hugetlbfs more straight-forward.

> 
> > 
> > > This patch series introduces a [per hugepage size] "sysctl",
> > > hugepages_nodes_allowed, that specifies a nodemask to constrain
> > > the allocation of persistent, fresh huge pages.   The nodemask
> > > may be specified by a sysctl, a sysfs huge pages attribute and
> > > on the kernel boot command line.  
> > > 
> > > The series includes a patch to free hugepages from the pool in a
> > > "round robin" fashion, interleaved across all on-line nodes to
> > > balance the hugepage pool across nodes.  Nish had a patch to do
> > > this, too.
> > > 
> > > Together, these changes don't provide the fine grain of control
> > > that per node attributes would. 
> > 
> > I'm failing to understand at the moment why mem policies set by numactl
> > would not do the job for allocation at least. Freeing is a different problem.
> 
> They could be made to work.  I actually started coding up a patch to
> extract a "nodes allowed" mask from the policy for use with
> alloc_fresh_huge_page[_node]() so that I could maintain the overall
> structure and use alloc_page_exact_node() with nodes in the allowed
> mask.  And, as you say, with some investigation, we may find that we can
> use alloc_pages_current() with appropriate flags to achieve the exact
> node semantics on each node in the policy.
> 

I'd like to see it investigated more please.

> > 
> > > Specifically, there is no easy
> > > way to reduce the persistent huge page count for a specific node.
> > > I think the degree of control provided by these patches is the
> > > minimal necessary and sufficient for managing the persistent the
> > > huge page pool.  However, with a bit more reorganization,  we
> > > could implement per node controls if others would find that
> > > useful.
> > > 
> > > For more info, see the patch descriptions and the updated kernel
> > > hugepages documentation.
> > > 
> > 
> 
> More in response to your comments on the individual patches.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/5] Huge Pages Nodes Allowed
  2009-06-18  9:33     ` Mel Gorman
@ 2009-06-18 14:46       ` Lee Schermerhorn
  2009-06-18 15:00         ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Lee Schermerhorn @ 2009-06-18 14:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, akpm, Nishanth Aravamudan, Adam Litke, Andy Whitcroft,
	eric.whitney

On Thu, 2009-06-18 at 10:33 +0100, Mel Gorman wrote:
> On Wed, Jun 17, 2009 at 01:15:54PM -0400, Lee Schermerhorn wrote:
> > On Wed, 2009-06-17 at 14:02 +0100, Mel Gorman wrote:
> > > On Tue, Jun 16, 2009 at 09:52:28AM -0400, Lee Schermerhorn wrote:
> > > > Because of assymmetries in some NUMA platforms, and "interesting"
> > > > topologies emerging in the "scale up x86" world, we have need for
> > > > better control over the placement of "fresh huge pages".  A while
> > > > back Nish Aravamundan floated a series of patches to add per node
> > > > controls for allocating pages to the hugepage pool and removing
> > > > them.  Nish apparently moved on to other tasks before those patches
> > > > were accepted.  I have kept a copy of Nish's patches and have
> > > > intended to rebase and test them and resubmit.
> > > > 
> > > > In an [off-list] exchange with Mel Gorman, who admits to knowledge
> > > > in the huge pages area, I asked his opinion of per node controls
> > > > for huge pages and he suggested another approach:  using the mempolicy
> > > > of the task that changes nr_hugepages to constrain the fresh huge
> > > > page allocations.  I considered this approach but it seemed to me
> > > > to be a misuse of mempolicy for populating the huge pages free
> > > > pool. 
> > > 
> > > Why would it be a misuse? Fundamentally, the huge page pools are being
> > > filled by the current process when nr_hugepages is being used. Or are
> > > you concerned about the specification of hugepages on the kernel command
> > > line?
> > 
> > Well, "misuse" might be too strong [and I already softened it from
> > "abuse" :)]--more like "not a good fit, IMO".   I agree that it could be
> > made to work, but it just didn't "feel" right.  I suppose that we could
> > use alloc_pages_current() with the necessary gfp flags to specify the
> > "thisnode"/"exact_node" semantics [no fallback], making sure we didn't
> > OOM kill the task if a node couldn't satisfy the huge page
> > allocation, ...  I think it would require major restructuring of
> > set_max_huge_pages(), ...  Maybe that would be the right thing to do and
> > maybe we'll end up there, but I was looking for a simpler approach.  
> > 
> 
> Ultimately I think it's the right way of doing things. hugepages in the
> past had a large number of "special" considerations in comparison to the
> core VM. We've removed a fair few of the special cases in the last year
> and I'd rather avoid introducing more of them.
> 
> If I was as an administrator, I would prefer being able to do allocate
> hugepages only on two nodes with
> 
> # numactl --interleave 0,2 hugeadm --pool-pages-min=2M:128

Of course, this means that hugeadm itself and any library pages it pulls
in will be interleaved across those nodes.  I don't know that that's an
issue.  hugeadm or any command used to resize the pool will need to run
somewhere.

> 
> instead of
> 
> # echo some_magic > /proc/sys/vm/hugepage_nodes_allowed
> # hugeadm --pool-pages-min=2M:128
> 
> hugeadm would have to be taught about the new sysctl of course if it's created
> but it still feels more awkward than it should be and numactl should already
> be familar.
> 
> > And, yes, the kernel command line was a consideration.  Because of that,
> > I considered specifying a boot time "nodes_allowed" mask and
> > constructing a "nodes_allowed" mask at run time from the current task's
> > policy.  But, it seemed more straightforward to my twisted way of
> > thinking to make the nodes_allowed mask a bona fide hstate attribute.
> > 
> 
> I see scope for allowing the nodemask to be specified at boot-time all right,
> heck it might even be required! 

Yeah, the hugepages doc does recommend that one allocate the pool at
boot time or in an init script when the allocations still have a good
chance of succeeding.

> However, I am somewhat attached to the idea
> of static hugepage allocation obeying memory policies.

Really?  I couldn't tell ;).

> 
> > > 
> > > > Interleave policy doesn't have same "this node" semantics
> > > > that we want
> > > 
> > > By "this node" semantics, do you mean allocating from one specific node?
> > 
> > Yes.
> > 
> > > In that case, why would specifying a nodemask of just one node not be
> > > sufficient?
> > 
> > Maybe along with GFP_THISNODE. 
> 
> But a nodemask of just one node might as well be GFP_THISNODE, right?

No.  w/o GFP_THISNODE, the allocation will use the generic zonelist and
will fallback to another node if the target node doesn't have any
available huge pages.  Looks just like any other higher order page
allocation.  And w/o 'THISNODE, I think we'll OOM kill the task
increasing max hugepages if the allocation fails--need to test this.  We
may want/accept this behavior for satisfying tasks' page faults, but not
for allocating static huge pages into the pool--IMO, anyway.

So, my approach to using task mempolicy would be to construct a
"nodes_allowed" mask from the policy and then use the same
alloc_pages_exact_node() with the same flags we currently use.  This is
what I mean about it being a "misfit".  What I really want is a node
mask over which I'll distribute the huge pages.  The policy mode [bind,
pref, interleave, local] is irrelevant other than in constructing this
mask.  That's what led me to the nodes_allowed attribute/sysctl.

> 
> > In general, we want to interleave the
> > persistent huge pages over multiple nodes, but we don't want the
> > allocations to succeed by falling back to another node.  Then we end up
> > with unbalanced allocations.  This is how fresh huge page allocation
> > worked before it was changed to use "THISNODE".
> > 
> > 
> > > 
> > > > and bind policy would require constructing a custom
> > > > node mask for node as well as addressing OOM, which we don't want
> > > > during fresh huge page allocation. 
> > > 
> > > Would the required mask not already be setup when the process set the
> > > policy? OOM is not a major concern, it doesn't trigger for failed
> > > hugepage allocations.
> > 
> > Perhaps not with the current implementation.  However, a certain,
> > currently shipping, enterprise distro added a "quick and dirty"  [well,
> > "dirty", anyway :)] implementation of "thisnode" behavior to elimiate
> > the unwanted fallback behavior and resulting huge pages imbalance by
> > constructing an ad hoc "MPOL_BIND" policy for allocating huge pages.  We
> > found that this DOES result in oom kill of the task increasing
> > nr_hugepages if it encountered a node w/ no available huge pages.
> > 
> 
> Fun.
> 
> > If one used "echo" [usually a shell builtin] to increase nr_hugepages,
> > it killed off your shell.  sysctl was somewhat better--only killed
> > sysctl.  The good news was that oom kill just sets a thread info flag,
> > so the nr_hugepages handler continued to allocate pages around the nodes
> > and further oom kills were effective no-ops.  When the allocations was
> > done, the victim task would die.  End users might not consider this
> > acceptable behavior.
> > 
> 
> No. My current expectation is that we've handled the vast majority of
> cases where processes could get unexpectedly killed just because they
> were looking funny at hugepages.

Yes, doesn't seem to happen with current mainline kernel with or without
this series.

> 
> > I wasn't sure we'd avoid this situation if we dropped back to using task
> > mempolicy via, e.g., alloc_pages_current().  So, I thought I'd run this
> > proposal [nodes_allowed] by you.
> > 
> 
> I'm not wholesale against it. Minimally, I see scope for specifying the
> nodemask at boot-time but I'm less sold on the sysctl at the moment because
> I'm not convinced that applying memory policies when sizing the hugepage pool
> cannot be made work to solve your problem.

Oh, I'm sure they can be made to work.  Even for the boot command line,
one could specify a policy and we have the mechanism to parse that.
However, it still feels to me like the wrong tool for the job.
Populating the pool is, IMO, quite different from handling page faults.

I do worry about the proverbial "junior admin", charged with increasing
the huge page pool, forgetting to use numactl and the correct policy as
determined by, say, the system architect.  I suppose one could provide
wrappers around hugeadm or whatever to always use some predefined 


I think we have a couple of models at play here:

1) a more static, architected system layout, where I want to place the
static huge pages on a specific set of nodes [perhaps at boot time]
where the will be used for some long-running "enterprise" application[s]
to which the system is more or less dedicated.  Alternatively, one may
just want to avoid [ever] having static huge pages allocated on some set
of nodes.

2) a more dynamic model using libhugetlbfs [which which I have ZERO
experience :(] in which we are using overcommitted/surplus huge pages
in addition to a fairly uniformly distributed pool of static huge pages,
counting on defragmentation and lumpy reclaim to satisfy requests for
huge pages in excess of the static pool.

Perhaps you see things differently.  However, if you agree, I'm thinking
we might want to describe these models in the hugepage kernel doc.  Even
if you don't agree, I think it would be a good idea to mention the use
cases for huge pages in that doc.

> 
> > > 
> > > > One could derive a node mask
> > > > of allowed nodes for huge pages from the mempolicy of the task
> > > > that is modifying nr_hugepages and use that for fresh huge pages
> > > > with GFP_THISNODE.  However, if we're not going to use mempolicy
> > > > directly--e.g., via alloc_page_current() or alloc_page_vma() [with
> > > > yet another on-stack pseudo-vma :(]--I thought it cleaner to
> > > > define a "nodes allowed" nodemask for populating the [persistent]
> > > > huge pages free pool.
> > > > 
> > > 
> > > How about adding alloc_page_mempolicy() that takes the explicit mempolicy
> > > you need?
> > 
> > Interesting you should mention this.  Somewhat off topic:  I have a
> > "cleanup" and reorg of shared policy and vma policy that separates
> > policy lookup from allocation, and adds an "alloc_page_pol()" function.
> > This is part of my series to generalize shared policy and extend it to
> > shared, mmap()ed regular files.  That aspect of the series [shared
> > policy on shared mmaped files] got a lot of push back without any
> > consideration of the technical details of the patches themselves.
> 
> Those arguements feel vaguely familiar. I think I read the thread although
> the specifics of the objections escape me. Think there was something about
> non-determinism if two processes shared a mapping with different policies
> and no idea which should take precedence.

Actually, that last sentence describes the situation I'm trying to
avoid.  But, that's another conversation...

> > Besides having need/requests for this capability, the resulting cleanup,
> > removal of all on-stack pseudo-vmas, ..., the series actually seems to
> > perform [slightly] better on the testing I've done.
> > 
> > I keep this series up to date and hope to repost again sometime with
> > benchmark results. 
> > 
> 
> Sounds like a good idea. If the cleanup yielded an alloc_page_pol()
> function, it might make the patch for hugetlbfs more straight-forward.
> 
> > 
> > > 
> > > > This patch series introduces a [per hugepage size] "sysctl",
> > > > hugepages_nodes_allowed, that specifies a nodemask to constrain
> > > > the allocation of persistent, fresh huge pages.   The nodemask
> > > > may be specified by a sysctl, a sysfs huge pages attribute and
> > > > on the kernel boot command line.  
> > > > 
> > > > The series includes a patch to free hugepages from the pool in a
> > > > "round robin" fashion, interleaved across all on-line nodes to
> > > > balance the hugepage pool across nodes.  Nish had a patch to do
> > > > this, too.
> > > > 
> > > > Together, these changes don't provide the fine grain of control
> > > > that per node attributes would. 
> > > 
> > > I'm failing to understand at the moment why mem policies set by numactl
> > > would not do the job for allocation at least. Freeing is a different problem.
> > 
> > They could be made to work.  I actually started coding up a patch to
> > extract a "nodes allowed" mask from the policy for use with
> > alloc_fresh_huge_page[_node]() so that I could maintain the overall
> > structure and use alloc_page_exact_node() with nodes in the allowed
> > mask.  And, as you say, with some investigation, we may find that we can
> > use alloc_pages_current() with appropriate flags to achieve the exact
> > node semantics on each node in the policy.
> > 
> 
> I'd like to see it investigated more please.

OK.  That will take a bit longer. :)   I do think we'll need to
reorganize set_max_huge_pages to "drive" the nodes allowed [from the
policy] from the top loop, rather than bury it down in
alloc_fresh_huge_page().  Might even end up eliminating that function
and call the per node version directly.  If we drive from the top, we
can use the policy mask for freeing static huge pages, as well, altho'
we'd have to decide the sense of the mask:  does is specify the nodes
from which to free or the nodes where we want static huge pages to
remain.

I had planned anyway to go back and look at this, even with the
nodes_allowed attribute/sysctl so that, when decreasing nr_hugepages, we
free unused huge pages and demote to surplus in-use pages on the
non-allowed nodes before those on the allowed nodes.

So, I think I'll clean up this series, based on your feed back, so we'll
have something to compare with the mempolicy version and then look into
the alternate implementation.  It'll be a while [probably] before I can
spend the time to do the latter.

Thanks for your efforts to review these.

Lee

> 
> > > 
> > > > Specifically, there is no easy
> > > > way to reduce the persistent huge page count for a specific node.
> > > > I think the degree of control provided by these patches is the
> > > > minimal necessary and sufficient for managing the persistent the
> > > > huge page pool.  However, with a bit more reorganization,  we
> > > > could implement per node controls if others would find that
> > > > useful.
> > > > 
> > > > For more info, see the patch descriptions and the updated kernel
> > > > hugepages documentation.
> > > > 
> > > 
> > 
> > More in response to your comments on the individual patches.
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/5] Huge Pages Nodes Allowed
  2009-06-18 14:46       ` Lee Schermerhorn
@ 2009-06-18 15:00         ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2009-06-18 15:00 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Nishanth Aravamudan, Adam Litke, Andy Whitcroft,
	eric.whitney

On Thu, Jun 18, 2009 at 10:46:34AM -0400, Lee Schermerhorn wrote:
> On Thu, 2009-06-18 at 10:33 +0100, Mel Gorman wrote:
> > On Wed, Jun 17, 2009 at 01:15:54PM -0400, Lee Schermerhorn wrote:
> > > On Wed, 2009-06-17 at 14:02 +0100, Mel Gorman wrote:
> > > > On Tue, Jun 16, 2009 at 09:52:28AM -0400, Lee Schermerhorn wrote:
> > > > > Because of assymmetries in some NUMA platforms, and "interesting"
> > > > > topologies emerging in the "scale up x86" world, we have need for
> > > > > better control over the placement of "fresh huge pages".  A while
> > > > > back Nish Aravamundan floated a series of patches to add per node
> > > > > controls for allocating pages to the hugepage pool and removing
> > > > > them.  Nish apparently moved on to other tasks before those patches
> > > > > were accepted.  I have kept a copy of Nish's patches and have
> > > > > intended to rebase and test them and resubmit.
> > > > > 
> > > > > In an [off-list] exchange with Mel Gorman, who admits to knowledge
> > > > > in the huge pages area, I asked his opinion of per node controls
> > > > > for huge pages and he suggested another approach:  using the mempolicy
> > > > > of the task that changes nr_hugepages to constrain the fresh huge
> > > > > page allocations.  I considered this approach but it seemed to me
> > > > > to be a misuse of mempolicy for populating the huge pages free
> > > > > pool. 
> > > > 
> > > > Why would it be a misuse? Fundamentally, the huge page pools are being
> > > > filled by the current process when nr_hugepages is being used. Or are
> > > > you concerned about the specification of hugepages on the kernel command
> > > > line?
> > > 
> > > Well, "misuse" might be too strong [and I already softened it from
> > > "abuse" :)]--more like "not a good fit, IMO".   I agree that it could be
> > > made to work, but it just didn't "feel" right.  I suppose that we could
> > > use alloc_pages_current() with the necessary gfp flags to specify the
> > > "thisnode"/"exact_node" semantics [no fallback], making sure we didn't
> > > OOM kill the task if a node couldn't satisfy the huge page
> > > allocation, ...  I think it would require major restructuring of
> > > set_max_huge_pages(), ...  Maybe that would be the right thing to do and
> > > maybe we'll end up there, but I was looking for a simpler approach.  
> > > 
> > 
> > Ultimately I think it's the right way of doing things. hugepages in the
> > past had a large number of "special" considerations in comparison to the
> > core VM. We've removed a fair few of the special cases in the last year
> > and I'd rather avoid introducing more of them.
> > 
> > If I was as an administrator, I would prefer being able to do allocate
> > hugepages only on two nodes with
> > 
> > # numactl --interleave 0,2 hugeadm --pool-pages-min=2M:128
> 
> Of course, this means that hugeadm itself and any library pages it pulls
> in will be interleaved across those nodes.  I don't know that that's an
> issue.  hugeadm or any command used to resize the pool will need to run
> somewhere.
> 

Pretty much. The amount of memory you are talking about for the hugeadm
binary is not worth worrying about. I guess there is a very small corner
case whereby a full node couldn't be allocated because the binary text
could not be loaded but that's pretty specific.

> > 
> > instead of
> > 
> > # echo some_magic > /proc/sys/vm/hugepage_nodes_allowed
> > # hugeadm --pool-pages-min=2M:128
> > 
> > hugeadm would have to be taught about the new sysctl of course if it's created
> > but it still feels more awkward than it should be and numactl should already
> > be familar.
> > 
> > > And, yes, the kernel command line was a consideration.  Because of that,
> > > I considered specifying a boot time "nodes_allowed" mask and
> > > constructing a "nodes_allowed" mask at run time from the current task's
> > > policy.  But, it seemed more straightforward to my twisted way of
> > > thinking to make the nodes_allowed mask a bona fide hstate attribute.
> > > 
> > 
> > I see scope for allowing the nodemask to be specified at boot-time all right,
> > heck it might even be required! 
> 
> Yeah, the hugepages doc does recommend that one allocate the pool at
> boot time or in an init script when the allocations still have a good
> chance of succeeding.
> 

Yep, while this has improved a lot, it still doesn't hurt.

> > However, I am somewhat attached to the idea
> > of static hugepage allocation obeying memory policies.
> 
> Really?  I couldn't tell ;).
> 

It must be my innate subtlety

> > 
> > > > 
> > > > > Interleave policy doesn't have same "this node" semantics
> > > > > that we want
> > > > 
> > > > By "this node" semantics, do you mean allocating from one specific node?
> > > 
> > > Yes.
> > > 
> > > > In that case, why would specifying a nodemask of just one node not be
> > > > sufficient?
> > > 
> > > Maybe along with GFP_THISNODE. 
> > 
> > But a nodemask of just one node might as well be GFP_THISNODE, right?
> 
> No.  w/o GFP_THISNODE, the allocation will use the generic zonelist and
> will fallback to another node if the target node doesn't have any
> available huge pages.  Looks just like any other higher order page
> allocation.  And w/o 'THISNODE, I think we'll OOM kill the task
> increasing max hugepages if the allocation fails--need to test this. 

Test this, because OOM should not be triggering for high order allocations
like this. It was too easy to trigger OOM before by just writing large
numbers to nr_hugepages.

> We
> may want/accept this behavior for satisfying tasks' page faults, but not
> for allocating static huge pages into the pool--IMO, anyway.
> 
> So, my approach to using task mempolicy would be to construct a
> "nodes_allowed" mask from the policy and then use the same
> alloc_pages_exact_node() with the same flags we currently use.  This is
> what I mean about it being a "misfit".  What I really want is a node
> mask over which I'll distribute the huge pages.  The policy mode [bind,
> pref, interleave, local] is irrelevant other than in constructing this
> mask.  That's what led me to the nodes_allowed attribute/sysctl.
> 

but it does sound like alloc_pages_pol() which takes a policy and
nodemask might also do the job.

> > 
> > > In general, we want to interleave the
> > > persistent huge pages over multiple nodes, but we don't want the
> > > allocations to succeed by falling back to another node.  Then we end up
> > > with unbalanced allocations.  This is how fresh huge page allocation
> > > worked before it was changed to use "THISNODE".
> > > 
> > > 
> > > > 
> > > > > and bind policy would require constructing a custom
> > > > > node mask for node as well as addressing OOM, which we don't want
> > > > > during fresh huge page allocation. 
> > > > 
> > > > Would the required mask not already be setup when the process set the
> > > > policy? OOM is not a major concern, it doesn't trigger for failed
> > > > hugepage allocations.
> > > 
> > > Perhaps not with the current implementation.  However, a certain,
> > > currently shipping, enterprise distro added a "quick and dirty"  [well,
> > > "dirty", anyway :)] implementation of "thisnode" behavior to elimiate
> > > the unwanted fallback behavior and resulting huge pages imbalance by
> > > constructing an ad hoc "MPOL_BIND" policy for allocating huge pages.  We
> > > found that this DOES result in oom kill of the task increasing
> > > nr_hugepages if it encountered a node w/ no available huge pages.
> > > 
> > 
> > Fun.
> > 
> > > If one used "echo" [usually a shell builtin] to increase nr_hugepages,
> > > it killed off your shell.  sysctl was somewhat better--only killed
> > > sysctl.  The good news was that oom kill just sets a thread info flag,
> > > so the nr_hugepages handler continued to allocate pages around the nodes
> > > and further oom kills were effective no-ops.  When the allocations was
> > > done, the victim task would die.  End users might not consider this
> > > acceptable behavior.
> > > 
> > 
> > No. My current expectation is that we've handled the vast majority of
> > cases where processes could get unexpectedly killed just because they
> > were looking funny at hugepages.
> 
> Yes, doesn't seem to happen with current mainline kernel with or without
> this series.
> 

Good, that's what I was expecting.

> > 
> > > I wasn't sure we'd avoid this situation if we dropped back to using task
> > > mempolicy via, e.g., alloc_pages_current().  So, I thought I'd run this
> > > proposal [nodes_allowed] by you.
> > > 
> > 
> > I'm not wholesale against it. Minimally, I see scope for specifying the
> > nodemask at boot-time but I'm less sold on the sysctl at the moment because
> > I'm not convinced that applying memory policies when sizing the hugepage pool
> > cannot be made work to solve your problem.
> 
> Oh, I'm sure they can be made to work.  Even for the boot command line,
> one could specify a policy and we have the mechanism to parse that.
> However, it still feels to me like the wrong tool for the job.
> Populating the pool is, IMO, quite different from handling page faults.
> 

Yeah, what I'm suggesting is a bit of a hack, but it's in line with how
NUMA is controlled in other situations.

> I do worry about the proverbial "junior admin", charged with increasing
> the huge page pool, forgetting to use numactl and the correct policy as
> determined by, say, the system architect.  I suppose one could provide
> wrappers around hugeadm or whatever to always use some predefined 
> 

Yeah. However, I would point out that if he's getting numactl wrong,
there is no guarantee that he would get the sysctl right.

> 
> I think we have a couple of models at play here:
> 
> 1) a more static, architected system layout, where I want to place the
> static huge pages on a specific set of nodes [perhaps at boot time]
> where the will be used for some long-running "enterprise" application[s]
> to which the system is more or less dedicated.  Alternatively, one may
> just want to avoid [ever] having static huge pages allocated on some set
> of nodes.
> 
> 2) a more dynamic model using libhugetlbfs [which which I have ZERO
> experience :(] in which we are using overcommitted/surplus huge pages
> in addition to a fairly uniformly distributed pool of static huge pages,
> counting on defragmentation and lumpy reclaim to satisfy requests for
> huge pages in excess of the static pool.
> 

The more dynamic model should use the normal NUMA policies and what
exists already.

> Perhaps you see things differently.  However, if you agree, I'm thinking
> we might want to describe these models in the hugepage kernel doc.  Even
> if you don't agree, I think it would be a good idea to mention the use
> cases for huge pages in that doc.
> 

It's been on my TODO list to describe this for I don't know how long :(
. I'll take this as another nudge that I really should get around to it.

> > 
> > > > 
> > > > > One could derive a node mask
> > > > > of allowed nodes for huge pages from the mempolicy of the task
> > > > > that is modifying nr_hugepages and use that for fresh huge pages
> > > > > with GFP_THISNODE.  However, if we're not going to use mempolicy
> > > > > directly--e.g., via alloc_page_current() or alloc_page_vma() [with
> > > > > yet another on-stack pseudo-vma :(]--I thought it cleaner to
> > > > > define a "nodes allowed" nodemask for populating the [persistent]
> > > > > huge pages free pool.
> > > > > 
> > > > 
> > > > How about adding alloc_page_mempolicy() that takes the explicit mempolicy
> > > > you need?
> > > 
> > > Interesting you should mention this.  Somewhat off topic:  I have a
> > > "cleanup" and reorg of shared policy and vma policy that separates
> > > policy lookup from allocation, and adds an "alloc_page_pol()" function.
> > > This is part of my series to generalize shared policy and extend it to
> > > shared, mmap()ed regular files.  That aspect of the series [shared
> > > policy on shared mmaped files] got a lot of push back without any
> > > consideration of the technical details of the patches themselves.
> > 
> > Those arguements feel vaguely familiar. I think I read the thread although
> > the specifics of the objections escape me. Think there was something about
> > non-determinism if two processes shared a mapping with different policies
> > and no idea which should take precedence.
> 
> Actually, that last sentence describes the situation I'm trying to
> avoid.  But, that's another conversation...
> 

With shared policies for regular files sure, but I'm not sure it
directly applies to the problem we are trying to solve here - be more
specific what nodes are used to fill the static hugepage pool.

> > > Besides having need/requests for this capability, the resulting cleanup,
> > > removal of all on-stack pseudo-vmas, ..., the series actually seems to
> > > perform [slightly] better on the testing I've done.
> > > 
> > > I keep this series up to date and hope to repost again sometime with
> > > benchmark results. 
> > > 
> > 
> > Sounds like a good idea. If the cleanup yielded an alloc_page_pol()
> > function, it might make the patch for hugetlbfs more straight-forward.
> > 
> > > 
> > > > 
> > > > > This patch series introduces a [per hugepage size] "sysctl",
> > > > > hugepages_nodes_allowed, that specifies a nodemask to constrain
> > > > > the allocation of persistent, fresh huge pages.   The nodemask
> > > > > may be specified by a sysctl, a sysfs huge pages attribute and
> > > > > on the kernel boot command line.  
> > > > > 
> > > > > The series includes a patch to free hugepages from the pool in a
> > > > > "round robin" fashion, interleaved across all on-line nodes to
> > > > > balance the hugepage pool across nodes.  Nish had a patch to do
> > > > > this, too.
> > > > > 
> > > > > Together, these changes don't provide the fine grain of control
> > > > > that per node attributes would. 
> > > > 
> > > > I'm failing to understand at the moment why mem policies set by numactl
> > > > would not do the job for allocation at least. Freeing is a different problem.
> > > 
> > > They could be made to work.  I actually started coding up a patch to
> > > extract a "nodes allowed" mask from the policy for use with
> > > alloc_fresh_huge_page[_node]() so that I could maintain the overall
> > > structure and use alloc_page_exact_node() with nodes in the allowed
> > > mask.  And, as you say, with some investigation, we may find that we can
> > > use alloc_pages_current() with appropriate flags to achieve the exact
> > > node semantics on each node in the policy.
> > > 
> > 
> > I'd like to see it investigated more please.
> 
> OK.  That will take a bit longer. :)   I do think we'll need to
> reorganize set_max_huge_pages to "drive" the nodes allowed [from the
> policy] from the top loop, rather than bury it down in
> alloc_fresh_huge_page().  Might even end up eliminating that function
> and call the per node version directly. 

Sounds reasonable.

> If we drive from the top, we
> can use the policy mask for freeing static huge pages, as well, altho'
> we'd have to decide the sense of the mask:  does is specify the nodes
> from which to free or the nodes where we want static huge pages to
> remain.
> 

I'd interpret it to mean the nodes from which we are freeing from - i.e. the
nodes_allowed specifies the nodes we are operating on whether we are
allocating or freeing.

> I had planned anyway to go back and look at this, even with the
> nodes_allowed attribute/sysctl so that, when decreasing nr_hugepages, we
> free unused huge pages and demote to surplus in-use pages on the
> non-allowed nodes before those on the allowed nodes.
> 
> So, I think I'll clean up this series, based on your feed back, so we'll
> have something to compare with the mempolicy version and then look into
> the alternate implementation.  It'll be a while [probably] before I can
> spend the time to do the latter.
> 

Sounds good.

> Thanks for your efforts to review these.
> 

Thanks for investigating this.

> Lee
> 
> > 
> > > > 
> > > > > Specifically, there is no easy
> > > > > way to reduce the persistent huge page count for a specific node.
> > > > > I think the degree of control provided by these patches is the
> > > > > minimal necessary and sufficient for managing the persistent the
> > > > > huge page pool.  However, with a bit more reorganization,  we
> > > > > could implement per node controls if others would find that
> > > > > useful.
> > > > > 
> > > > > For more info, see the patch descriptions and the updated kernel
> > > > > hugepages documentation.
> > > > > 
> > > > 
> > > 
> > > More in response to your comments on the individual patches.
> > > 
> > 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/5] Update huge pages kernel documentation
  2009-06-16 13:53 ` [PATCH 5/5] Update huge pages kernel documentation Lee Schermerhorn
@ 2009-06-18 18:49   ` David Rientjes
  2009-06-18 19:06     ` Lee Schermerhorn
  0 siblings, 1 reply; 31+ messages in thread
From: David Rientjes @ 2009-06-18 18:49 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, Mel Gorman, Nishanth Aravamudan, Adam Litke,
	Andy Whitcroft, eric.whitney

On Tue, Jun 16, 2009 at 6:53 AM, Lee
Schermerhorn<lee.schermerhorn@hp.com> wrote:
> @@ -67,26 +65,76 @@ use either the mmap system call or share
>  the huge pages.  It is required that the system administrator preallocate
>  enough memory for huge page purposes.
>
> -Use the following command to dynamically allocate/deallocate hugepages:
> +The administrator can preallocate huge pages on the kernel boot command line by
> +specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
> +requested.  This is the most reliable method for preallocating huge pages as
> +memory has not yet become fragmented.
> +
> +Some platforms support multiple huge page sizes.  To preallocate huge pages
> +of a specific size, one must preceed the huge pages boot command parameters
> +with a huge page size selection parameter "hugepagesz=<size>".  <size> must
> +be specified in bytes with optional scale suffix [kKmMgG].  The default huge
> +page size may be selected with the "default_hugepagesz=<size>" boot parameter.
> +
> +/proc/sys/vm/nr_hugepages indicates the current number of configured [default
> +size] hugetlb pages in the kernel.  Super user can dynamically request more
> +(or free some pre-configured) hugepages.
> +
> +Use the following command to dynamically allocate/deallocate default sized
> +hugepages:
>
>        echo 20 > /proc/sys/vm/nr_hugepages
>
> -This command will try to configure 20 hugepages in the system.  The success
> -or failure of allocation depends on the amount of physically contiguous
> -memory that is preset in system at this time.  System administrators may want
> -to put this command in one of the local rc init files.  This will enable the
> -kernel to request huge pages early in the boot process (when the possibility
> -of getting physical contiguous pages is still very high). In either
> -case, administrators will want to verify the number of hugepages actually
> -allocated by checking the sysctl or meminfo.
> +This command will try to configure 20 default sized hugepages in the system.
> +On a NUMA platform, the kernel will attempt to distribute the hugepage pool
> +over the nodes specified by the /proc/sys/vm/hugepages_nodes_allowed node mask.
> +hugepages_nodes_allowed defaults to all on-line nodes.
> +
> +To control the nodes on which huge pages are preallocated, the administrator
> +may set the hugepages_nodes_allowed for the default huge page size using:
> +
> +       echo <nodelist> >/proc/sys/vm/hugepages_nodes_allowed
> +

This probably also needs an update to
Documentation/ABI/testing/sysfs-kernel-mm-hugepages for the
non-default hstate nodes_allowed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/5] Update huge pages kernel documentation
  2009-06-18 18:49   ` David Rientjes
@ 2009-06-18 19:06     ` Lee Schermerhorn
  0 siblings, 0 replies; 31+ messages in thread
From: Lee Schermerhorn @ 2009-06-18 19:06 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, akpm, Mel Gorman, Nishanth Aravamudan, Adam Litke,
	Andy Whitcroft, eric.whitney

On Thu, 2009-06-18 at 11:49 -0700, David Rientjes wrote:
> On Tue, Jun 16, 2009 at 6:53 AM, Lee
> Schermerhorn<lee.schermerhorn@hp.com> wrote:
> > @@ -67,26 +65,76 @@ use either the mmap system call or share
> >  the huge pages.  It is required that the system administrator preallocate
> >  enough memory for huge page purposes.
> >
> > -Use the following command to dynamically allocate/deallocate hugepages:
> > +The administrator can preallocate huge pages on the kernel boot command line by
> > +specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
> > +requested.  This is the most reliable method for preallocating huge pages as
> > +memory has not yet become fragmented.
> > +
> > +Some platforms support multiple huge page sizes.  To preallocate huge pages
> > +of a specific size, one must preceed the huge pages boot command parameters
> > +with a huge page size selection parameter "hugepagesz=<size>".  <size> must
> > +be specified in bytes with optional scale suffix [kKmMgG].  The default huge
> > +page size may be selected with the "default_hugepagesz=<size>" boot parameter.
> > +
> > +/proc/sys/vm/nr_hugepages indicates the current number of configured [default
> > +size] hugetlb pages in the kernel.  Super user can dynamically request more
> > +(or free some pre-configured) hugepages.
> > +
> > +Use the following command to dynamically allocate/deallocate default sized
> > +hugepages:
> >
> >        echo 20 > /proc/sys/vm/nr_hugepages
> >
> > -This command will try to configure 20 hugepages in the system.  The success
> > -or failure of allocation depends on the amount of physically contiguous
> > -memory that is preset in system at this time.  System administrators may want
> > -to put this command in one of the local rc init files.  This will enable the
> > -kernel to request huge pages early in the boot process (when the possibility
> > -of getting physical contiguous pages is still very high). In either
> > -case, administrators will want to verify the number of hugepages actually
> > -allocated by checking the sysctl or meminfo.
> > +This command will try to configure 20 default sized hugepages in the system.
> > +On a NUMA platform, the kernel will attempt to distribute the hugepage pool
> > +over the nodes specified by the /proc/sys/vm/hugepages_nodes_allowed node mask.
> > +hugepages_nodes_allowed defaults to all on-line nodes.
> > +
> > +To control the nodes on which huge pages are preallocated, the administrator
> > +may set the hugepages_nodes_allowed for the default huge page size using:
> > +
> > +       echo <nodelist> >/proc/sys/vm/hugepages_nodes_allowed
> > +
> 
> This probably also needs an update to
> Documentation/ABI/testing/sysfs-kernel-mm-hugepages for the
> non-default hstate nodes_allowed.


Thanks, David.  I'll take a look and address that in the next respin of
the series.  If you've been following the exchange with Mel, you'll know
that the approach may change quite a bit.  However it ends up, I'll
update the abi testing doc or yell for help.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/5] Huge Pages Nodes Allowed
  2009-06-17 17:15   ` Lee Schermerhorn
  2009-06-18  9:33     ` Mel Gorman
@ 2009-06-18 19:08     ` David Rientjes
  2009-06-24  7:11       ` David Rientjes
  1 sibling, 1 reply; 31+ messages in thread
From: David Rientjes @ 2009-06-18 19:08 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, linux-mm, akpm, Nishanth Aravamudan, Adam Litke,
	Andy Whitcroft, eric.whitney

On Wed, 17 Jun 2009, Lee Schermerhorn wrote:

> > > Specifically, there is no easy
> > > way to reduce the persistent huge page count for a specific node.
> > > I think the degree of control provided by these patches is the
> > > minimal necessary and sufficient for managing the persistent the
> > > huge page pool.  However, with a bit more reorganization,  we
> > > could implement per node controls if others would find that
> > > useful.
> > > 

Thanks for looking at this, it's going to be very useful.

>From a cpusets perspective, control over the hugepage allocations on 
various nodes is essential when the number of nodes on the system is not 
trivially small.  While cpusets are generally used to assign applications 
to a set of nodes to which they have affinity, they are also hierarchial 
so that within a cpuset, nodes can be further divided as a means of 
resource management.  With your extensions, this could potentially include 
hugepage management in addition to strict memory isolation.

Manipulating hugepages via a nodemask seems less ideal than, as you 
mentioned, per-node hugepage controls, probably via 
/sys/kernel/system/node/node*/nr_hugepages.  This type of interface 
provides all the functionality that this patchset does, including hugepage 
allocation and freeing, but with more power to explicitly allocate and 
free on targeted nodes.  /proc/sys/vm/nr_hugepages would remain to 
round-robin the allocation (and freeing, with your patch 1/5 which I 
ack'd).

Such an interface would also automatically deal with all memory 
hotplug/remove issues without storing or keeping a nodemask updated.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] Free huge pages round robin to balance across nodes
  2009-06-17 17:16     ` Lee Schermerhorn
@ 2009-06-18 19:08       ` David Rientjes
  0 siblings, 0 replies; 31+ messages in thread
From: David Rientjes @ 2009-06-18 19:08 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, linux-mm, Andrew Morton, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney

Acked-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/5] Huge Pages Nodes Allowed
  2009-06-18 19:08     ` David Rientjes
@ 2009-06-24  7:11       ` David Rientjes
  2009-06-24 11:25         ` Lee Schermerhorn
  0 siblings, 1 reply; 31+ messages in thread
From: David Rientjes @ 2009-06-24  7:11 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, linux-mm, Andrew Morton, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney, Ranjit Manomohan

On Thu, 18 Jun 2009, David Rientjes wrote:

> Manipulating hugepages via a nodemask seems less ideal than, as you 
> mentioned, per-node hugepage controls, probably via 
> /sys/kernel/system/node/node*/nr_hugepages.  This type of interface 
> provides all the functionality that this patchset does, including hugepage 
> allocation and freeing, but with more power to explicitly allocate and 
> free on targeted nodes.  /proc/sys/vm/nr_hugepages would remain to 
> round-robin the allocation (and freeing, with your patch 1/5 which I 
> ack'd).
> 
> Such an interface would also automatically deal with all memory 
> hotplug/remove issues without storing or keeping a nodemask updated.
> 

Expanding this proposal out a little bit, we'd want all the power of the 
/sys/kernel/mm/hugepages tunables for each node.  The best way of doing 
that is probably to keep the current /sys/kernel/mm/hugepages directory as 
is (already published Documentation/ABI/testing/sysfs-kernel-mm-hugepages) 
for the system-wide hugepage state and then add individual 
`hugepages-<size>kB' directories to each /sys/devices/system/node/node* to 
target allocations and freeing for the per-node hugepage state.  
Otherwise, we lack node targeted support for multiple hugepagesz= users.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/5] Huge Pages Nodes Allowed
  2009-06-24  7:11       ` David Rientjes
@ 2009-06-24 11:25         ` Lee Schermerhorn
  2009-06-24 22:26           ` David Rientjes
  0 siblings, 1 reply; 31+ messages in thread
From: Lee Schermerhorn @ 2009-06-24 11:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, linux-mm, Andrew Morton, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney, Ranjit Manomohan

On Wed, 2009-06-24 at 00:11 -0700, David Rientjes wrote:
> On Thu, 18 Jun 2009, David Rientjes wrote:
> 
> > Manipulating hugepages via a nodemask seems less ideal than, as you 
> > mentioned, per-node hugepage controls, probably via 
> > /sys/kernel/system/node/node*/nr_hugepages.  This type of interface 
> > provides all the functionality that this patchset does, including hugepage 
> > allocation and freeing, but with more power to explicitly allocate and 
> > free on targeted nodes.  /proc/sys/vm/nr_hugepages would remain to 
> > round-robin the allocation (and freeing, with your patch 1/5 which I 
> > ack'd).
> > 
> > Such an interface would also automatically deal with all memory 
> > hotplug/remove issues without storing or keeping a nodemask updated.
> > 
> 
> Expanding this proposal out a little bit, we'd want all the power of the 
> /sys/kernel/mm/hugepages tunables for each node.  The best way of doing 
> that is probably to keep the current /sys/kernel/mm/hugepages directory as 
> is (already published Documentation/ABI/testing/sysfs-kernel-mm-hugepages) 
> for the system-wide hugepage state and then add individual 
> `hugepages-<size>kB' directories to each /sys/devices/system/node/node* to 
> target allocations and freeing for the per-node hugepage state.  
> Otherwise, we lack node targeted support for multiple hugepagesz= users.

David:

Nish mentioned this to me a while back when I asked about his patches.
That's one of my reasons for seeing if the simpler [IMO] nodes_allowed
would be sufficient.  I'm currently updating the nodes_allowed series
per Mel's cleanup suggestions.  I'll then prototype Mel's preferred
method of using the task's mempolicy.  I still have reservations about
this:  static huge page allocation is currently not constrained by
policy nor cpusets, and I can't tell whether the task's mempolicy was
set explicitly to contstrain the huge pages or just inherited from the
parent shell.

Next I'll also dust off Nish's old per node hugetlb control patches and
see what it task to update them for the multiple sizes.  It will look
pretty much as you suggest.  Do you have any suggestions for a boot
command line syntax to specify per node huge page counts at boot time
[assuming we still want this]?  Currently, for default huge page size,
distributed across nodes, we have:

	hugepages=<N>

I was thinking something like:

	hugepages=(node:count,...)

using the '(' as a flag for per node counts, w/o needing to prescan for
':'

Thoughts?

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/5] Huge Pages Nodes Allowed
  2009-06-24 11:25         ` Lee Schermerhorn
@ 2009-06-24 22:26           ` David Rientjes
  2009-06-25  2:14             ` Lee Schermerhorn
  0 siblings, 1 reply; 31+ messages in thread
From: David Rientjes @ 2009-06-24 22:26 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, linux-mm, Andrew Morton, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney, Ranjit Manomohan

On Wed, 24 Jun 2009, Lee Schermerhorn wrote:

> David:
> 
> Nish mentioned this to me a while back when I asked about his patches.
> That's one of my reasons for seeing if the simpler [IMO] nodes_allowed
> would be sufficient.  I'm currently updating the nodes_allowed series
> per Mel's cleanup suggestions.

The /proc/sys/vm/hugepages_nodes_allowed support is troublesome because 
it's global and can race with other writers, such as for tasks in two 
disjoint cpusets attempting to allocate hugepages concurrently.

> I'll then prototype Mel's preferred
> method of using the task's mempolicy.

This proposal eliminates the aforementioned race, but now has the opposite 
problem: if a single task is allocating hugepages for multiple cpusets, it 
must setup the correct mempolicies to allocate (or free) for the new 
cpuset mems.

> I still have reservations about
> this:  static huge page allocation is currently not constrained by
> policy nor cpusets, and I can't tell whether the task's mempolicy was
> set explicitly to contstrain the huge pages or just inherited from the
> parent shell.
> 

Agreed.  I do think that we used to constrain hugepage allocations by 
cpuset in the past, though, when the allocation was simply done via 
alloc_pages_node().  We'd fallback to using nodes outside of 
cpuset_current_mems_allowed when the nodes were too full or fragmented.

> Next I'll also dust off Nish's old per node hugetlb control patches and
> see what it task to update them for the multiple sizes.  It will look
> pretty much as you suggest.  Do you have any suggestions for a boot
> command line syntax to specify per node huge page counts at boot time
> [assuming we still want this]?  Currently, for default huge page size,
> distributed across nodes, we have:
> 
> 	hugepages=<N>
> 
> I was thinking something like:
> 
> 	hugepages=(node:count,...)
> 
> using the '(' as a flag for per node counts, w/o needing to prescan for
> ':'
> 

The hugepages=(node:count,...) option would still need to be interleaved 
with hugepagesz= for the various sizes.  This could become pretty cryptic:

	hugepagesz=2M hugepages=(0:10,1:20) hugepagesz=1G 	\
		hugepages=(2:10,3:10)

and I assume we'd use `count' of 99999 for nodes of unknown sizes where we 
simply want to allocate as many hugepages as possible.

We'd still need to support hugepages=N for large NUMA machines so we don't 
have to specify the same number of hugepages per node for a true 
interleave, which would require an extremely large command line.  And then 
the behavior of

	hugepagesz=1G hugepages=(0:10,1:20) hugepages=30

needs to be defined.  In that case, does hugepages=30 override the 
previous settings if this system only has dual nodes?  If so, for SGI's 1K 
node systems it's going to be difficult to specify many nodes with 10 
hugepages and a few with 20.  So perhaps hugepages=(node:count,...) should 
increment or decrement the hugepages= value, if specified?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/5] Huge Pages Nodes Allowed
  2009-06-24 22:26           ` David Rientjes
@ 2009-06-25  2:14             ` Lee Schermerhorn
  2009-06-25 19:22               ` David Rientjes
  0 siblings, 1 reply; 31+ messages in thread
From: Lee Schermerhorn @ 2009-06-25  2:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, linux-mm, Andrew Morton, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney, Ranjit Manomohan

On Wed, 2009-06-24 at 15:26 -0700, David Rientjes wrote:
> On Wed, 24 Jun 2009, Lee Schermerhorn wrote:
> 
> > David:
> > 
> > Nish mentioned this to me a while back when I asked about his patches.
> > That's one of my reasons for seeing if the simpler [IMO] nodes_allowed
> > would be sufficient.  I'm currently updating the nodes_allowed series
> > per Mel's cleanup suggestions.
> 
> The /proc/sys/vm/hugepages_nodes_allowed support is troublesome because 
> it's global and can race with other writers, such as for tasks in two 
> disjoint cpusets attempting to allocate hugepages concurrently.

Agreed.  If one has multiple administrators or privileged tasks trying
to modify the huge page pool simultaneously, this is problematic.  For
"single administrator" system, it might be workable.  And it seemed a
fairly minimal change at a time where I didn't see much interest in this
area in the community.

> 
> > I'll then prototype Mel's preferred
> > method of using the task's mempolicy.
> 
> This proposal eliminates the aforementioned race, but now has the opposite 
> problem: if a single task is allocating hugepages for multiple cpusets, it 
> must setup the correct mempolicies to allocate (or free) for the new 
> cpuset mems.

Agreed.  To support this proposal, we'll need to construct a "nodes
allowed" mask from the policy [and I don't think we want default policy
to mean "local" in this case--big change in behavior!] and pass that to
the allocation functions.  Racing allocators can then use different
masks.

> 
> > I still have reservations about
> > this:  static huge page allocation is currently not constrained by
> > policy nor cpusets, and I can't tell whether the task's mempolicy was
> > set explicitly to contstrain the huge pages or just inherited from the
> > parent shell.
> > 
> 
> Agreed.  I do think that we used to constrain hugepage allocations by 
> cpuset in the past, though, when the allocation was simply done via 
> alloc_pages_node().  We'd fallback to using nodes outside of 
> cpuset_current_mems_allowed when the nodes were too full or fragmented.

I do recall a discussion about huge page allocation being constrained by
cpusets or not.  I don't recall whether they were and this was changed,
or they weren't [as is currently the case] and someone [Nish?] proposed
to change that.  I'd need to search for that exchange.  

Would having cpusets constrain huge page pool allocation meet your
needs?

> 
> > Next I'll also dust off Nish's old per node hugetlb control patches and
> > see what it task to update them for the multiple sizes.  It will look
> > pretty much as you suggest.  Do you have any suggestions for a boot
> > command line syntax to specify per node huge page counts at boot time
> > [assuming we still want this]?  Currently, for default huge page size,
> > distributed across nodes, we have:
> > 
> > 	hugepages=<N>
> > 
> > I was thinking something like:
> > 
> > 	hugepages=(node:count,...)
> > 
> > using the '(' as a flag for per node counts, w/o needing to prescan for
> > ':'
> > 
> 
> The hugepages=(node:count,...) option would still need to be interleaved 
> with hugepagesz= for the various sizes.  

Well, yes.  I'd assumed that.  


> This could become pretty cryptic:
> 
> 	hugepagesz=2M hugepages=(0:10,1:20) hugepagesz=1G 	\
> 		hugepages=(2:10,3:10)
> 
> and I assume we'd use `count' of 99999 for nodes of unknown sizes where we 
> simply want to allocate as many hugepages as possible.

If one needed that capability--"allocate as many as possible"--then,
yes, I guess any ridiculously large count would do the trick.

> 
> We'd still need to support hugepages=N for large NUMA machines so we don't 
> have to specify the same number of hugepages per node for a true 
> interleave, which would require an extremely large command line.  And then 
> the behavior of
> 
> 	hugepagesz=1G hugepages=(0:10,1:20) hugepages=30
> 
> needs to be defined.  In that case, does hugepages=30 override the 
> previous settings if this system only has dual nodes?  If so, for SGI's 1K 
> node systems it's going to be difficult to specify many nodes with 10 
> hugepages and a few with 20.  So perhaps hugepages=(node:count,...) should 
> increment or decrement the hugepages= value, if specified?

Mel mentioned that we probably don't need boot command line hugepage
allocation all that much with lumpy reclaim, etc.  I can see his point.
If we can't allocate all the hugepages we need from an early init script
or similar, we probably don't have enough memory anyway.  For
compatibility, I supposed we need to retain the hugepages= parameter.
And, we've added the hugepagesz parameter, so we need to retain that.
But, maybe we should initially limit per node allocations to sysfs node
attributes post boot?

-------------
Related question:  do you think we need per node overcommit limits?  I'm
having difficulty understanding what the semantics of the global limit
would be with per node limits--i.e., how would one distribute the global
limit across nodes [for backwards compatibility].  With nr_hugepages,
today we just do a best effort to distribute the requested number of
pages over the on-line nodes.  If we fail to allocate that many, we
don't remember the initial request, just how many we actually allocated
where ever they landed.  But, I don't see how that works with limits.  I
suppose we could arrange that if you don't specify a per node limit, the
global limit applies when attempting to allocate a surplus page on a
given node.  If you do [so specify], then the respective node limit
applies, whether or not the sum of per node surplus pages exceeds the
global limit.

Thoughts?

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/5] Huge Pages Nodes Allowed
  2009-06-25  2:14             ` Lee Schermerhorn
@ 2009-06-25 19:22               ` David Rientjes
  0 siblings, 0 replies; 31+ messages in thread
From: David Rientjes @ 2009-06-25 19:22 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, linux-mm, Andrew Morton, Nishanth Aravamudan,
	Adam Litke, Andy Whitcroft, eric.whitney, Ranjit Manomohan

On Wed, 24 Jun 2009, Lee Schermerhorn wrote:

> Would having cpusets constrain huge page pool allocation meet your
> needs?
> 

It would, but it seems like an unnecessary inconvenience.  It would 
require the admin task to join a cpuset to allocate hugepages for an 
application while allocating them.  It also is more difficult to expand a 
cpuset to include a new node that has a specific threshold of hugepages 
available if writing to 
/sys/devices/system/node/node*/hugepages-<size>kB/nr_hugepages is used to 
preallocate hugepages to determine which node has the least fragmentation 
to support such an allocation in the first place (and then freeing them if 
they cannot be allocated).

It also doesn't support users who don't have CONFIG_CPUSETS, but do 
mbind their memory to a subset of nodes that need hugepages while others 
do not.

> > This could become pretty cryptic:
> > 
> > 	hugepagesz=2M hugepages=(0:10,1:20) hugepagesz=1G 	\
> > 		hugepages=(2:10,3:10)
> > 
> > and I assume we'd use `count' of 99999 for nodes of unknown sizes where we 
> > simply want to allocate as many hugepages as possible.
> 
> If one needed that capability--"allocate as many as possible"--then,
> yes, I guess any ridiculously large count would do the trick.
> 

I remember on older kernels that large hugepages= values would cause the 
system not to boot because subsequent kernel allocations would fail 
because it's oom.  We need to do hugepages= allocation as early as 
possible to avoid additional fragmentation later, but not enough to 
completely oom the kernel.  The only way to prevent that is with a 
maximum hugepage watermark (which would no longer be global but per zone 
with the hugepages=(node:count,...) support to allow at least a certain 
threshold of memory to be free to the kernel for boot.

> > We'd still need to support hugepages=N for large NUMA machines so we don't 
> > have to specify the same number of hugepages per node for a true 
> > interleave, which would require an extremely large command line.  And then 
> > the behavior of
> > 
> > 	hugepagesz=1G hugepages=(0:10,1:20) hugepages=30
> > 
> > needs to be defined.  In that case, does hugepages=30 override the 
> > previous settings if this system only has dual nodes?  If so, for SGI's 1K 
> > node systems it's going to be difficult to specify many nodes with 10 
> > hugepages and a few with 20.  So perhaps hugepages=(node:count,...) should 
> > increment or decrement the hugepages= value, if specified?
> 
> Mel mentioned that we probably don't need boot command line hugepage
> allocation all that much with lumpy reclaim, etc.  I can see his point.

Understood, especially with the complexity of specifying them on the 
command line in the first place :)  The only concern I have is for users 
who want hugepages preallocated on a specific set of nodes at boot and are 
required to use much higher hugepages= values to allocate on all system 
nodes and then just free the pages on the nodes they aren't interested in.

> If we can't allocate all the hugepages we need from an early init script
> or similar, we probably don't have enough memory anyway.  For
> compatibility, I supposed we need to retain the hugepages= parameter.
> And, we've added the hugepagesz parameter, so we need to retain that.
> But, maybe we should initially limit per node allocations to sysfs node
> attributes post boot?
> 

Agreed, it solves the early boot oom failures as well.

> -------------
> Related question:  do you think we need per node overcommit limits?

I do, because applications constrained to an exclusive cpuset will only be 
able to allocate from its set of allowable nodes anyway, so the global 
overcommit limits aren't in effect.  There needs to be a mechanism to 
allow such allocations to take place for such constrained tasks.

The only reason I've proposed these hugepage tunables to be attributes of 
the system's nodes and not attributes of the individual cpusets is because 
non-exclusive cpusets may share nodes among siblings and parents will 
always share nodes with children, so the tunables could become 
inconsistent with one another.  For all other purposes, they really are a 
characteristic of the cpuset's job, however.

> I'm
> having difficulty understanding what the semantics of the global limit
> would be with per node limits--i.e., how would one distribute the global
> limit across nodes [for backwards compatibility].  With nr_hugepages,
> today we just do a best effort to distribute the requested number of
> pages over the on-line nodes.  If we fail to allocate that many, we
> don't remember the initial request, just how many we actually allocated
> where ever they landed.  But, I don't see how that works with limits.  I
> suppose we could arrange that if you don't specify a per node limit, the
> global limit applies when attempting to allocate a surplus page on a
> given node.  If you do [so specify], then the respective node limit
> applies, whether or not the sum of per node surplus pages exceeds the
> global limit.
> 

Hmm, yes, that's a concern.  I think the easiest way to do it would be to 
respect both the global and node surplus limits when the node limit is 0, 
and respect the node surplus limit when it is positive.  The setting of 
either the global limit or the node limit does not change the other.  This 
deals with the cpuset case where applications constrained to a cpuset may 
only allocate from their own nodes.

This would allow the global limit to exceed the sum of the node limits and 
for hugepage allocation to fail, but that would have required a node limit 
to be set on each online node.  In such a situation, the global limit has, 
in effect, been obsoleted and its value no longer matters.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2009-06-25 19:21 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-16 13:52 [PATCH 0/5] Huge Pages Nodes Allowed Lee Schermerhorn
2009-06-16 13:52 ` [PATCH 1/5] Free huge pages round robin to balance across nodes Lee Schermerhorn
2009-06-17 13:18   ` Mel Gorman
2009-06-17 17:16     ` Lee Schermerhorn
2009-06-18 19:08       ` David Rientjes
2009-06-16 13:52 ` [PATCH 2/5] Add nodes_allowed members to hugepages hstate struct Lee Schermerhorn
2009-06-17 13:35   ` Mel Gorman
2009-06-17 17:38     ` Lee Schermerhorn
2009-06-18  9:17       ` Mel Gorman
2009-06-16 13:53 ` [PATCH 3/5] Use per hstate nodes_allowed to constrain huge page allocation Lee Schermerhorn
2009-06-17 13:39   ` Mel Gorman
2009-06-17 17:47     ` Lee Schermerhorn
2009-06-18  9:18       ` Mel Gorman
2009-06-16 13:53 ` [PATCH 4/5] Add sysctl for default hstate nodes_allowed Lee Schermerhorn
2009-06-17 13:41   ` Mel Gorman
2009-06-17 17:52     ` Lee Schermerhorn
2009-06-18  9:19       ` Mel Gorman
2009-06-16 13:53 ` [PATCH 5/5] Update huge pages kernel documentation Lee Schermerhorn
2009-06-18 18:49   ` David Rientjes
2009-06-18 19:06     ` Lee Schermerhorn
2009-06-17 13:02 ` [PATCH 0/5] Huge Pages Nodes Allowed Mel Gorman
2009-06-17 17:15   ` Lee Schermerhorn
2009-06-18  9:33     ` Mel Gorman
2009-06-18 14:46       ` Lee Schermerhorn
2009-06-18 15:00         ` Mel Gorman
2009-06-18 19:08     ` David Rientjes
2009-06-24  7:11       ` David Rientjes
2009-06-24 11:25         ` Lee Schermerhorn
2009-06-24 22:26           ` David Rientjes
2009-06-25  2:14             ` Lee Schermerhorn
2009-06-25 19:22               ` David Rientjes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.