linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Fix hugetlb pool allocation with empty nodes
@ 2007-05-03  2:21 Anton Blanchard
  2007-05-03  3:02 ` Christoph Lameter
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Anton Blanchard @ 2007-05-03  2:21 UTC (permalink / raw)
  To: linux-mm, clameter, ak; +Cc: nish.aravamudan, mel, apw

An interesting bug was pointed out to me where we failed to allocate
hugepages evenly. In the example below node 7 has no memory (it only has
CPUs). Node 0 and 1 have plenty of free memory. After doing:

# echo 16 > /proc/sys/vm/nr_hugepages

We see the imbalance:

# cat /sys/devices/system/node/node*/meminfo|grep HugePages_Total
Node 0 HugePages_Total:     6
Node 1 HugePages_Total:     10
Node 7 HugePages_Total:     0

It didnt take long to realise that alloc_fresh_huge_page is allocating
from node 7 without GFP_THISNODE set, so we fallback to its next
preferred node (ie 1). This means we end up with a 1/3 2/3 imbalance.

After fixing this it still didnt work, and after some more poking I see
why. When building our fallback zonelist in build_zonelists_node we
skip empty zones. This means zone 7 never registers node 7's empty
zonelists and instead registers node 1's. Therefore when we ask for a
page from node 7, using the GFP_THISNODE flag we end up with node 1
memory.

By removing the populated_zone() check in build_zonelists_node we fix
the problem:

# cat /sys/devices/system/node/node*/meminfo|grep HugePages_Total
Node 0 HugePages_Total:     8
Node 1 HugePages_Total:     8
Node 7 HugePages_Total:     0

Im guessing registering empty remote zones might make the SGI guys a bit
unhappy, maybe we should just force the registration of empty local
zones? Does anyone care?

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: kernel/mm/hugetlb.c
===================================================================
--- kernel.orig/mm/hugetlb.c	2007-05-02 20:46:03.000000000 -0500
+++ kernel/mm/hugetlb.c	2007-05-02 20:48:15.000000000 -0500
@@ -103,11 +103,18 @@ static int alloc_fresh_huge_page(void)
 {
 	static int nid = 0;
 	struct page *page;
-	page = alloc_pages_node(nid, GFP_HIGHUSER|__GFP_COMP|__GFP_NOWARN,
+	int start_nid = nid;
+
+	do {
+		page = alloc_pages_node(nid,
+					GFP_HIGHUSER|__GFP_COMP|GFP_THISNODE,
 					HUGETLB_PAGE_ORDER);
-	nid = next_node(nid, node_online_map);
-	if (nid == MAX_NUMNODES)
-		nid = first_node(node_online_map);
+
+		nid = next_node(nid, node_online_map);
+		if (nid == MAX_NUMNODES)
+			nid = first_node(node_online_map);
+	} while (!page && nid != start_nid);
+
 	if (page) {
 		set_compound_page_dtor(page, free_huge_page);
 		spin_lock(&hugetlb_lock);
Index: kernel/mm/page_alloc.c
===================================================================
--- kernel.orig/mm/page_alloc.c	2007-05-02 20:46:03.000000000 -0500
+++ kernel/mm/page_alloc.c	2007-05-02 20:47:59.000000000 -0500
@@ -1659,10 +1659,8 @@ static int __meminit build_zonelists_nod
 	do {
 		zone_type--;
 		zone = pgdat->node_zones + zone_type;
-		if (populated_zone(zone)) {
-			zonelist->zones[nr_zones++] = zone;
-			check_highest_zone(zone_type);
-		}
+		zonelist->zones[nr_zones++] = zone;
+		check_highest_zone(zone_type);
 
 	} while (zone_type);
 	return nr_zones;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes
  2007-05-03  2:21 [PATCH] Fix hugetlb pool allocation with empty nodes Anton Blanchard
@ 2007-05-03  3:02 ` Christoph Lameter
  2007-05-03  6:07   ` Anton Blanchard
  2007-05-03  8:59 ` Andi Kleen
  2007-05-04 20:29 ` [PATCH] Fix hugetlb pool allocation with empty nodes - V2 Lee Schermerhorn
  2 siblings, 1 reply; 27+ messages in thread
From: Christoph Lameter @ 2007-05-03  3:02 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: linux-mm, ak, nish.aravamudan, mel, apw

On Wed, 2 May 2007, Anton Blanchard wrote:

> It didnt take long to realise that alloc_fresh_huge_page is allocating
> from node 7 without GFP_THISNODE set, so we fallback to its next
> preferred node (ie 1). This means we end up with a 1/3 2/3 imbalance.

Yup.
 
> After fixing this it still didnt work, and after some more poking I see
> why. When building our fallback zonelist in build_zonelists_node we
> skip empty zones. This means zone 7 never registers node 7's empty
> zonelists and instead registers node 1's. Therefore when we ask for a
> page from node 7, using the GFP_THISNODE flag we end up with node 1
> memory.
> 
> By removing the populated_zone() check in build_zonelists_node we fix
> the problem:

Looks good. I guess that is possible now that memory policy
zonelist building skips empty zonelists. Andi?

> Im guessing registering empty remote zones might make the SGI guys a bit
> unhappy, maybe we should just force the registration of empty local
> zones? Does anyone care?

Why would that make us unhappy?

Note that this is a direct result of allowing node without memorys. We 
only recently allowed such things while being aware that there will be 
some breakage. This is one. If the empty node would not have been marked 
online then we would not have attempted an allocation there.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes
  2007-05-03  3:02 ` Christoph Lameter
@ 2007-05-03  6:07   ` Anton Blanchard
  2007-05-03  6:37     ` Christoph Lameter
  0 siblings, 1 reply; 27+ messages in thread
From: Anton Blanchard @ 2007-05-03  6:07 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, ak, nish.aravamudan, mel, apw

 
Hi,

> > Im guessing registering empty remote zones might make the SGI guys a bit
> > unhappy, maybe we should just force the registration of empty local
> > zones? Does anyone care?
> 
> Why would that make us unhappy?

Since SGI boxes can have lots of NUMA nodes I was worried the patch
might negatively affect you. It sounds like thats not so much of an
issue.

Anton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes
  2007-05-03  6:07   ` Anton Blanchard
@ 2007-05-03  6:37     ` Christoph Lameter
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Lameter @ 2007-05-03  6:37 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: linux-mm, ak, nish.aravamudan, mel, apw

On Thu, 3 May 2007, Anton Blanchard wrote:

> > Why would that make us unhappy?
> 
> Since SGI boxes can have lots of NUMA nodes I was worried the patch
> might negatively affect you. It sounds like thats not so much of an
> issue.

SGI Altix systems only have a single zone on each node. Thus the zone 
fallback crap does not happen. And they are symmetric.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes
  2007-05-03  2:21 [PATCH] Fix hugetlb pool allocation with empty nodes Anton Blanchard
  2007-05-03  3:02 ` Christoph Lameter
@ 2007-05-03  8:59 ` Andi Kleen
  2007-05-03 13:22   ` Anton Blanchard
  2007-05-04 20:29 ` [PATCH] Fix hugetlb pool allocation with empty nodes - V2 Lee Schermerhorn
  2 siblings, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2007-05-03  8:59 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: linux-mm, clameter, nish.aravamudan, mel, apw

> Im guessing registering empty remote zones might make the SGI guys a bit
> unhappy, maybe we should just force the registration of empty local
> zones? Does anyone care?

I care. Don't do that please. Empty nodes cause all kinds of problems.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes
  2007-05-03  8:59 ` Andi Kleen
@ 2007-05-03 13:22   ` Anton Blanchard
  0 siblings, 0 replies; 27+ messages in thread
From: Anton Blanchard @ 2007-05-03 13:22 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, clameter, nish.aravamudan, mel, apw

Hi,
 
> > Im guessing registering empty remote zones might make the SGI guys a bit
> > unhappy, maybe we should just force the registration of empty local
> > zones? Does anyone care?
> 
> I care. Don't do that please. Empty nodes cause all kinds of problems.

Could you be more specific? How else do we fix the problem I just
identified?

Anton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes - V2
  2007-05-03  2:21 [PATCH] Fix hugetlb pool allocation with empty nodes Anton Blanchard
  2007-05-03  3:02 ` Christoph Lameter
  2007-05-03  8:59 ` Andi Kleen
@ 2007-05-04 20:29 ` Lee Schermerhorn
  2007-05-04 21:27   ` Christoph Lameter
  2 siblings, 1 reply; 27+ messages in thread
From: Lee Schermerhorn @ 2007-05-04 20:29 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: linux-mm, clameter, ak, nish.aravamudan, mel, apw,
	Christoph Lameter, Andrew Morton, Eric Whitney

On Wed, 2007-05-02 at 21:21 -0500, Anton Blanchard wrote:
> An interesting bug was pointed out to me where we failed to allocate
> hugepages evenly. In the example below node 7 has no memory (it only has
> CPUs). Node 0 and 1 have plenty of free memory. After doing:

Here's my attempt to fix the problem [I see it on HP platforms as well],
without removing the population check in build_zonelists_node().  Seems
to work.

[Because I had to rebase the patch to 21-rc7-mm2 where I'm working, I
just refreshed the entire patch, instead of creating an incremental
patch on top of Anton's.]

---------------------------------------------------------------------

[PATCH] Fix hugetlb pool allocation with empty nodes V2

Against 2.6.21-rc7-mm2

Changes V1 [Anton]  -> V2 [Lee]:

1) reverted the populated_zone() check in build_zonelists_node to avoid
   empty zones in the allocation zonelists.

2) added a populated_zone() check to alloc_fresh_huge_page().  Skip
   nodes whose zone corresponding to GFP_HIGHUSER is empty.

--------
Original description:

An interesting bug was pointed out to me where we failed to allocate
hugepages evenly. In the example below node 7 has no memory (it only has
CPUs). Node 0 and 1 have plenty of free memory. After doing:

# echo 16 > /proc/sys/vm/nr_hugepages

We see the imbalance:

# cat /sys/devices/system/node/node*/meminfo|grep HugePages_Total
Node 0 HugePages_Total:     6
Node 1 HugePages_Total:     10
Node 7 HugePages_Total:     0

It didnt take long to realise that alloc_fresh_huge_page is allocating
from node 7 without GFP_THISNODE set, so we fallback to its next
preferred node (ie 1). This means we end up with a 1/3 2/3 imbalance.

After fixing this it still didnt work, and after some more poking I see
why. When building our fallback zonelist in build_zonelists_node we
skip empty zones. This means zone 7 never registers node 7's empty
zonelists and instead registers node 1's. Therefore when we ask for a
page from node 7, using the GFP_THISNODE flag we end up with node 1
memory.

<snip bit about removing pop check from build_zonelists_node...>

Add zone population check to alloc_fresh_huge_page() and skip nodes 
with unpopulated zone.

V2 testing:

Tested on 4-node, 32GB HP NUMA platform with funky 512MB pseudo-zone
for hardware interleaved memory.  The pseudo-zone contains only ZONE_DMA
memory.  Without this patch, after "echo 64 >/proc/sys/vm/nr_hugepages",
"cat /sys/devices/system/node/node*/meminfo | grep HugeP" would yield:

Node 0 HugePages_Total:    25
Node 0 HugePages_Free:     25
Node 1 HugePages_Total:    13
Node 1 HugePages_Free:     13
Node 2 HugePages_Total:    13
Node 2 HugePages_Free:     13
Node 3 HugePages_Total:    13
Node 3 HugePages_Free:     13
Node 4 HugePages_Total:     0
Node 4 HugePages_Free:      0

With patch:

Node 0 HugePages_Total:    16
Node 0 HugePages_Free:     16
Node 1 HugePages_Total:    16
Node 1 HugePages_Free:     16
Node 2 HugePages_Total:    16
Node 2 HugePages_Free:     16
Node 3 HugePages_Total:    16
Node 3 HugePages_Free:     16
Node 4 HugePages_Total:     0
Node 4 HugePages_Free:      0


Originally
Signed-off-by: Anton Blanchard <anton@samba.org>

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/hugetlb.c |   23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

Index: Linux/mm/hugetlb.c
===================================================================
--- Linux.orig/mm/hugetlb.c	2007-05-04 15:41:10.000000000 -0400
+++ Linux/mm/hugetlb.c	2007-05-04 15:48:22.000000000 -0400
@@ -107,11 +107,24 @@ static int alloc_fresh_huge_page(void)
 {
 	static int nid = 0;
 	struct page *page;
-	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
-					HUGETLB_PAGE_ORDER);
-	nid = next_node(nid, node_online_map);
-	if (nid == MAX_NUMNODES)
-		nid = first_node(node_online_map);
+	int start_nid = nid;
+
+	do {
+		pg_data_t *pgdat =  NODE_DATA(nid);
+		struct zone *zone = pgdat->node_zones + gfp_zone(GFP_HIGHUSER);
+
+		/*
+		 * accept only nodes with populated "HIGHUSER" zone
+		 */
+		if (populated_zone(zone))
+			page = alloc_pages_node(nid,
+					GFP_HIGHUSER|__GFP_COMP|GFP_THISNODE,
+  					HUGETLB_PAGE_ORDER);
+
+		nid = next_node(nid, node_online_map);
+		if (nid == MAX_NUMNODES)
+			nid = first_node(node_online_map);
+	} while (!page && nid != start_nid);
 	if (page) {
 		set_compound_page_dtor(page, free_huge_page);
 		spin_lock(&hugetlb_lock);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes - V2
  2007-05-04 20:29 ` [PATCH] Fix hugetlb pool allocation with empty nodes - V2 Lee Schermerhorn
@ 2007-05-04 21:27   ` Christoph Lameter
  2007-05-04 22:39     ` Nish Aravamudan
                       ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Christoph Lameter @ 2007-05-04 21:27 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Anton Blanchard, linux-mm, ak, nish.aravamudan, mel, apw,
	Andrew Morton, Eric Whitney

On Fri, 4 May 2007, Lee Schermerhorn wrote:

> On Wed, 2007-05-02 at 21:21 -0500, Anton Blanchard wrote:
> > An interesting bug was pointed out to me where we failed to allocate
> > hugepages evenly. In the example below node 7 has no memory (it only has
> > CPUs). Node 0 and 1 have plenty of free memory. After doing:
> 
> Here's my attempt to fix the problem [I see it on HP platforms as well],
> without removing the population check in build_zonelists_node().  Seems
> to work.

I think we need something like for_each_online_node for each node with
memory otherwise we are going to replicate this all over the place for 
memoryless nodes. Add a nodemap for populated nodes?

I.e.

for_each_mem_node?

Then you do not have to check the zone flags all the time. May avoid a lot 
of mess?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes - V2
  2007-05-04 21:27   ` Christoph Lameter
@ 2007-05-04 22:39     ` Nish Aravamudan
  2007-05-07 13:40     ` Lee Schermerhorn
  2007-05-09 16:37     ` [PATCH] Fix hugetlb pool allocation with empty nodes - V2 -> V3 Lee Schermerhorn
  2 siblings, 0 replies; 27+ messages in thread
From: Nish Aravamudan @ 2007-05-04 22:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee Schermerhorn, Anton Blanchard, linux-mm, ak, mel, apw,
	Andrew Morton, Eric Whitney

On 5/4/07, Christoph Lameter <clameter@sgi.com> wrote:
> On Fri, 4 May 2007, Lee Schermerhorn wrote:
>
> > On Wed, 2007-05-02 at 21:21 -0500, Anton Blanchard wrote:
> > > An interesting bug was pointed out to me where we failed to allocate
> > > hugepages evenly. In the example below node 7 has no memory (it only has
> > > CPUs). Node 0 and 1 have plenty of free memory. After doing:
> >
> > Here's my attempt to fix the problem [I see it on HP platforms as well],
> > without removing the population check in build_zonelists_node().  Seems
> > to work.
>
> I think we need something like for_each_online_node for each node with
> memory otherwise we are going to replicate this all over the place for
> memoryless nodes. Add a nodemap for populated nodes?
>
> I.e.
>
> for_each_mem_node?
>
> Then you do not have to check the zone flags all the time. May avoid a lot
> of mess?

I agree -- and we'd keep hugetlb.c relatively node-unaware. hugetlb.c
would only need the nodemap, I believe, and we could just change

               nid = next_node(nid, node_online_map);
               if (nid == MAX_NUMNODES)
                       nid = first_node(node_online_map);

to use mem_node_map or whatever it would be called (node_mem_map looks
weird to me (why is it node_online_map but for_each_online_node() ?)

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes - V2
  2007-05-04 21:27   ` Christoph Lameter
  2007-05-04 22:39     ` Nish Aravamudan
@ 2007-05-07 13:40     ` Lee Schermerhorn
  2007-05-09 16:37     ` [PATCH] Fix hugetlb pool allocation with empty nodes - V2 -> V3 Lee Schermerhorn
  2 siblings, 0 replies; 27+ messages in thread
From: Lee Schermerhorn @ 2007-05-07 13:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Anton Blanchard, linux-mm, ak, nish.aravamudan, mel, apw,
	Andrew Morton, Eric Whitney

On Fri, 2007-05-04 at 14:27 -0700, Christoph Lameter wrote:
> On Fri, 4 May 2007, Lee Schermerhorn wrote:
> 
> > On Wed, 2007-05-02 at 21:21 -0500, Anton Blanchard wrote:
> > > An interesting bug was pointed out to me where we failed to allocate
> > > hugepages evenly. In the example below node 7 has no memory (it only has
> > > CPUs). Node 0 and 1 have plenty of free memory. After doing:
> > 
> > Here's my attempt to fix the problem [I see it on HP platforms as well],
> > without removing the population check in build_zonelists_node().  Seems
> > to work.
> 
> I think we need something like for_each_online_node for each node with
> memory otherwise we are going to replicate this all over the place for 
> memoryless nodes. Add a nodemap for populated nodes?
> 
> I.e.
> 
> for_each_mem_node?
> 
> Then you do not have to check the zone flags all the time. May avoid a lot 
> of mess?

>From a performance point of view, I don't think using the zone flags to
figure out which zone to be looking at should cause any noticable
overhead.  I hope no one is increasing [or decreasing] nr_hugepages all
that often.  I would expect it to happen at boot time, or soon
thereafter.  Much later and you run the risk of not being able to
allocate hugepages because of fragmentation [Hi, Mel!].

We'll still need to iterate over such a mask multiple times until the
requested number of hugepages has been allocated.  Of course, this as
well as the current method, assumes that all nodes have approximately
the same amount of memory.  I've considered precalculating the number of
hugepages per node based on the amount of memory in each node, but this
would require that hugetlb.c have even more knowledge of the zones...

Anyway, I hit the problem [imbalance in # of hugepages per node with
memory due to memoryless nodes] about the time that Anton posted his
fix.  I thought that adding 3 [non-commentary/non-whitespace] lines in a
non-performance path in order to avoid empty zones in the zonelists was
a good tradeoff.  Silly me ;-)!

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes - V2 -> V3
  2007-05-04 21:27   ` Christoph Lameter
  2007-05-04 22:39     ` Nish Aravamudan
  2007-05-07 13:40     ` Lee Schermerhorn
@ 2007-05-09 16:37     ` Lee Schermerhorn
  2007-05-09 16:57       ` Christoph Lameter
                         ` (2 more replies)
  2 siblings, 3 replies; 27+ messages in thread
From: Lee Schermerhorn @ 2007-05-09 16:37 UTC (permalink / raw)
  To: Christoph Lameter, Anton Blanchard
  Cc: linux-mm, ak, nish.aravamudan, mel, apw, Andrew Morton, Eric Whitney

On Fri, 2007-05-04 at 14:27 -0700, Christoph Lameter wrote:
> On Fri, 4 May 2007, Lee Schermerhorn wrote:
> 
> > On Wed, 2007-05-02 at 21:21 -0500, Anton Blanchard wrote:
> > > An interesting bug was pointed out to me where we failed to allocate
> > > hugepages evenly. In the example below node 7 has no memory (it only has
> > > CPUs). Node 0 and 1 have plenty of free memory. After doing:
> > 
> > Here's my attempt to fix the problem [I see it on HP platforms as well],
> > without removing the population check in build_zonelists_node().  Seems
> > to work.
> 
> I think we need something like for_each_online_node for each node with
> memory otherwise we are going to replicate this all over the place for 
> memoryless nodes. Add a nodemap for populated nodes?
> 
> I.e.
> 
> for_each_mem_node?
> 
> Then you do not have to check the zone flags all the time. May avoid a lot 
> of mess?

OK, here's a rework that exports a node_populated_map and associated
access functions from page_alloc.c where we already check for populated
zones.  Maybe this should be "node_hugepages_map" ?  

Also, we might consider exporting this to user space for applications
that want to "interleave across all nodes with hugepages"--not that
hugetlbfs mappings currently obey "vma policy".  Could still be used
with the "set task policy before allocating region" method [not that I
advocate this method ;-)].

I don't think that a 'for_each_*_node()' macro is appropriate for this
usage, as allocate_fresh_huge_page() is an "incremental allocator" that
returns a page from the "next eligible node" on each call.

By the way:  does anything protect the "static int nid" in
allocate_fresh_huge_page() from racing attempts to set nr_hugepages?
Can this happen?  Do we care?

Again, I chose to rework Anton's original patch, maintaining his
rationale/discussion, rather create a separate patch.  Note the "Rework"
comments therein--especially regarding NORMAL zone.  I expect we'll need
a few more rounds of "discussion" on this issue.  And, it'll require
rework to merge with the "change zonelist order" series that hits the
same area.

Lee

[PATCH] Fix hugetlb pool allocation with empty nodes - V3

Date:	Wed, 2 May 2007 21:21:07 -0500
From: Anton Blanchard <anton@samba.org>
To: linux-mm@kvack.org, clameter@SGI.com, ak@suse.de
Cc: nish.aravamudan@gmail.com, mel@csn.ul.ie, apw@shadowen.org
Subject: [PATCH] Fix hugetlb pool allocation with empty nodes

An interesting bug was pointed out to me where we failed to allocate
hugepages evenly. In the example below node 7 has no memory (it only has
CPUs). Node 0 and 1 have plenty of free memory. After doing:

# echo 16 > /proc/sys/vm/nr_hugepages

We see the imbalance:

# cat /sys/devices/system/node/node*/meminfo|grep HugePages_Total
Node 0 HugePages_Total:     6
Node 1 HugePages_Total:     10
Node 7 HugePages_Total:     0

It didn't take long to realise that alloc_fresh_huge_page is allocating
from node 7 without GFP_THISNODE set, so we fallback to its next
preferred node (ie 1). This means we end up with a 1/3 2/3 imbalance.

After fixing this it still didnt work, and after some more poking I see
why. When building our fallback zonelist in build_zonelists_node we
skip empty zones. This means zone 7 never registers node 7's empty
zonelists and instead registers node 1's. Therefore when we ask for a
page from node 7, using the GFP_THISNODE flag we end up with node 1
memory.

By removing the populated_zone() check in build_zonelists_node we fix
the problem:

# cat /sys/devices/system/node/node*/meminfo|grep HugePages_Total
Node 0 HugePages_Total:     8
Node 1 HugePages_Total:     8
Node 7 HugePages_Total:     0

Im guessing registering empty remote zones might make the SGI guys a bit
unhappy, maybe we should just force the registration of empty local
zones? Does anyone care?

Rework:

Create node_populated_map and access functions [nodemask.h] to describe
nodes with populated gfp_zone(GFP_HIGHUSER).  Note that on x86, this
excludes nodes with only DMA* or NORMAL memory--i.e., no hugepages below
4G.  

Populate the map while building zonelists, where we already check for
populated zones.  This is early enough for boot time allocation of
huge pages.

Attempt to allocate "fresh" huge pages only from nodes in the populated
map.

Tested on ia64 numa platform with both boot time and run time allocation
of huge pages.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
 include/linux/nodemask.h |    9 +++++++++
 mm/hugetlb.c             |   20 +++++++++++++++-----
 mm/page_alloc.c          |   13 +++++++++++--
 3 files changed, 35 insertions(+), 7 deletions(-)

Index: Linux/mm/hugetlb.c
===================================================================
--- Linux.orig/mm/hugetlb.c	2007-05-08 10:27:15.000000000 -0400
+++ Linux/mm/hugetlb.c	2007-05-09 11:17:30.000000000 -0400
@@ -107,11 +107,21 @@ static int alloc_fresh_huge_page(void)
 {
 	static int nid = 0;
 	struct page *page;
-	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
-					HUGETLB_PAGE_ORDER);
-	nid = next_node(nid, node_online_map);
-	if (nid == MAX_NUMNODES)
-		nid = first_node(node_online_map);
+	int start_nid = nid;
+
+	do {
+		/*
+		 * accept only nodes with populated "HIGHUSER" zone
+		 */
+		if (node_populated(nid))
+			page = alloc_pages_node(nid,
+					GFP_HIGHUSER|__GFP_COMP|GFP_THISNODE,
+  					HUGETLB_PAGE_ORDER);
+
+		nid = next_node(nid, node_online_map);
+		if (nid == MAX_NUMNODES)
+			nid = first_node(node_online_map);
+	} while (!page && nid != start_nid);
 	if (page) {
 		set_compound_page_dtor(page, free_huge_page);
 		spin_lock(&hugetlb_lock);
Index: Linux/include/linux/nodemask.h
===================================================================
--- Linux.orig/include/linux/nodemask.h	2007-04-25 23:08:32.000000000 -0400
+++ Linux/include/linux/nodemask.h	2007-05-09 10:58:04.000000000 -0400
@@ -64,12 +64,16 @@
  *
  * int node_online(node)		Is some node online?
  * int node_possible(node)		Is some node possible?
+ * int node_populated(node)		Is some node populated [at 'HIGHUSER]
  *
  * int any_online_node(mask)		First online node in mask
  *
  * node_set_online(node)		set bit 'node' in node_online_map
  * node_set_offline(node)		clear bit 'node' in node_online_map
  *
+ * node_set_poplated(node)		set bit 'node' in node_populated_map
+ * node_not_poplated(node)		clear bit 'node' in node_populated_map
+ *
  * for_each_node(node)			for-loop node over node_possible_map
  * for_each_online_node(node)		for-loop node over node_online_map
  *
@@ -344,12 +348,14 @@ static inline void __nodes_remap(nodemas
 
 extern nodemask_t node_online_map;
 extern nodemask_t node_possible_map;
+extern nodemask_t node_populated_map;
 
 #if MAX_NUMNODES > 1
 #define num_online_nodes()	nodes_weight(node_online_map)
 #define num_possible_nodes()	nodes_weight(node_possible_map)
 #define node_online(node)	node_isset((node), node_online_map)
 #define node_possible(node)	node_isset((node), node_possible_map)
+#define node_populated(node)	node_isset((node), node_populated_map)
 #define first_online_node	first_node(node_online_map)
 #define next_online_node(nid)	next_node((nid), node_online_map)
 extern int nr_node_ids;
@@ -375,6 +381,9 @@ extern int nr_node_ids;
 #define node_set_online(node)	   set_bit((node), node_online_map.bits)
 #define node_set_offline(node)	   clear_bit((node), node_online_map.bits)
 
+#define node_set_populated(node)   set_bit((node), node_populated_map.bits)
+#define node_not_populated(node)   clear_bit((node), node_populated_map.bits)
+
 #define for_each_node(node)	   for_each_node_mask((node), node_possible_map)
 #define for_each_online_node(node) for_each_node_mask((node), node_online_map)
 
Index: Linux/mm/page_alloc.c
===================================================================
--- Linux.orig/mm/page_alloc.c	2007-05-08 11:47:45.000000000 -0400
+++ Linux/mm/page_alloc.c	2007-05-09 11:16:27.000000000 -0400
@@ -54,6 +54,9 @@ nodemask_t node_online_map __read_mostly
 EXPORT_SYMBOL(node_online_map);
 nodemask_t node_possible_map __read_mostly = NODE_MASK_ALL;
 EXPORT_SYMBOL(node_possible_map);
+nodemask_t node_populated_map __read_mostly = NODE_MASK_NONE;
+EXPORT_SYMBOL(node_populated_map);
+
 unsigned long totalram_pages __read_mostly;
 unsigned long totalreserve_pages __read_mostly;
 long nr_swap_pages;
@@ -2021,11 +2024,14 @@ void show_free_areas(void)
  * Builds allocation fallback zone lists.
  *
  * Add all populated zones of a node to the zonelist.
+ * Record nodes with populated gfp_zone(GFP_HIGHUSER) for huge page allocation.
  */
 static int __meminit build_zonelists_node(pg_data_t *pgdat,
-			struct zonelist *zonelist, int nr_zones, enum zone_type zone_type)
+			struct zonelist *zonelist, int nr_zones,
+			enum zone_type zone_type)
 {
 	struct zone *zone;
+	enum zone_type zone_highuser = gfp_zone(GFP_HIGHUSER);
 
 	BUG_ON(zone_type >= MAX_NR_ZONES);
 	zone_type++;
@@ -2036,7 +2042,10 @@ static int __meminit build_zonelists_nod
 		if (populated_zone(zone)) {
 			zonelist->zones[nr_zones++] = zone;
 			check_highest_zone(zone_type);
-		}
+			if (zone_type == zone_highuser)
+				node_set_populated(pgdat->node_id);
+		} else if (zone_type == zone_highuser)
+			node_not_populated(pgdat->node_id);
 
 	} while (zone_type);
 	return nr_zones;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes - V2 -> V3
  2007-05-09 16:37     ` [PATCH] Fix hugetlb pool allocation with empty nodes - V2 -> V3 Lee Schermerhorn
@ 2007-05-09 16:57       ` Christoph Lameter
  2007-05-09 19:17         ` Lee Schermerhorn
  2007-05-09 19:59       ` Nish Aravamudan
  2007-05-16 19:59       ` Nish Aravamudan
  2 siblings, 1 reply; 27+ messages in thread
From: Christoph Lameter @ 2007-05-09 16:57 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Anton Blanchard, linux-mm, ak, nish.aravamudan, mel, apw,
	Andrew Morton, Eric Whitney

On Wed, 9 May 2007, Lee Schermerhorn wrote:

> +  					HUGETLB_PAGE_ORDER);
> +
> +		nid = next_node(nid, node_online_map);
> +		if (nid == MAX_NUMNODES)
> +			nid = first_node(node_online_map);

Maybe use nr_node_ids here? May save some scanning over online maps?

>   * int node_possible(node)		Is some node possible?
> + * int node_populated(node)		Is some node populated [at 'HIGHUSER]
>   *

Good.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes - V2 -> V3
  2007-05-09 16:57       ` Christoph Lameter
@ 2007-05-09 19:17         ` Lee Schermerhorn
  2007-05-16 17:27           ` Nish Aravamudan
  0 siblings, 1 reply; 27+ messages in thread
From: Lee Schermerhorn @ 2007-05-09 19:17 UTC (permalink / raw)
  To: Christoph Lameter, Anton Blanchard
  Cc: linux-mm, ak, nish.aravamudan, mel, apw, Andrew Morton, Eric Whitney

On Wed, 2007-05-09 at 09:57 -0700, Christoph Lameter wrote:
> On Wed, 9 May 2007, Lee Schermerhorn wrote:
> 
> > +  					HUGETLB_PAGE_ORDER);
> > +
> > +		nid = next_node(nid, node_online_map);
> > +		if (nid == MAX_NUMNODES)
> > +			nid = first_node(node_online_map);
> 
> Maybe use nr_node_ids here? May save some scanning over online maps?

Good idea.  I won't get to it until next week.  Maybe we'll have more
comments by then.

Anton:  anything to add?  

> 
> >   * int node_possible(node)		Is some node possible?
> > + * int node_populated(node)		Is some node populated [at 'HIGHUSER]
> >   *
> 
> Good.

Thanks.

Later,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes - V2 -> V3
  2007-05-09 16:37     ` [PATCH] Fix hugetlb pool allocation with empty nodes - V2 -> V3 Lee Schermerhorn
  2007-05-09 16:57       ` Christoph Lameter
@ 2007-05-09 19:59       ` Nish Aravamudan
  2007-05-09 20:37         ` Lee Schermerhorn
  2007-05-16 19:59       ` Nish Aravamudan
  2 siblings, 1 reply; 27+ messages in thread
From: Nish Aravamudan @ 2007-05-09 19:59 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, Anton Blanchard, linux-mm, ak, mel, apw,
	Andrew Morton, Eric Whitney, William Lee Irwin III

[Adding wli to the Cc]

On 5/9/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> On Fri, 2007-05-04 at 14:27 -0700, Christoph Lameter wrote:
> > On Fri, 4 May 2007, Lee Schermerhorn wrote:
> >
> > > On Wed, 2007-05-02 at 21:21 -0500, Anton Blanchard wrote:
> > > > An interesting bug was pointed out to me where we failed to allocate
> > > > hugepages evenly. In the example below node 7 has no memory (it only has
> > > > CPUs). Node 0 and 1 have plenty of free memory. After doing:
> > >
> > > Here's my attempt to fix the problem [I see it on HP platforms as well],
> > > without removing the population check in build_zonelists_node().  Seems
> > > to work.
> >
> > I think we need something like for_each_online_node for each node with
> > memory otherwise we are going to replicate this all over the place for
> > memoryless nodes. Add a nodemap for populated nodes?
> >
> > I.e.
> >
> > for_each_mem_node?
> >
> > Then you do not have to check the zone flags all the time. May avoid a lot
> > of mess?
>
> OK, here's a rework that exports a node_populated_map and associated
> access functions from page_alloc.c where we already check for populated
> zones.  Maybe this should be "node_hugepages_map" ?
>
> Also, we might consider exporting this to user space for applications
> that want to "interleave across all nodes with hugepages"--not that
> hugetlbfs mappings currently obey "vma policy".  Could still be used
> with the "set task policy before allocating region" method [not that I
> advocate this method ;-)].

For libhugetlbfs purposes, with 1.1 and later, we've recommended folks
use numactl in coordination with the library to specify the policy.
After a kernel fix that submitted a while back (and has been merged
for at least a few releases), hugepages interleave properly when
requested.

> I don't think that a 'for_each_*_node()' macro is appropriate for this
> usage, as allocate_fresh_huge_page() is an "incremental allocator" that
> returns a page from the "next eligible node" on each call.
>
> By the way:  does anything protect the "static int nid" in
> allocate_fresh_huge_page() from racing attempts to set nr_hugepages?
> Can this happen?  Do we care?

Hrm, not sure if we care or not.

We've got a draft of patch that exports nr_hugepages on a per-node
basis in sysfs. Will post it soon, as an additional, more flexible
interface for dealing with hugepages on NUMA.

<snip>

> -       page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> -                                       HUGETLB_PAGE_ORDER);

<snip>

> +                       page = alloc_pages_node(nid,
> +                                       GFP_HIGHUSER|__GFP_COMP|GFP_THISNODE,
> +                                       HUGETLB_PAGE_ORDER);

Are we taking out the GFP_NOWARN for a reason? I noticed this in
Anton's patch, but forgot to ask.

<snip>

> Index: Linux/include/linux/nodemask.h

<snip>

> + * node_set_poplated(node)             set bit 'node' in node_populated_map
> + * node_not_poplated(node)             clear bit 'node' in node_populated_map

typos? (poplated v. populated)

<snip>

Looks reasonable otherwise.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes - V2 -> V3
  2007-05-09 19:59       ` Nish Aravamudan
@ 2007-05-09 20:37         ` Lee Schermerhorn
  2007-05-09 20:54           ` Christoph Lameter
  2007-05-09 22:34           ` Nish Aravamudan
  0 siblings, 2 replies; 27+ messages in thread
From: Lee Schermerhorn @ 2007-05-09 20:37 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: Christoph Lameter, Anton Blanchard, linux-mm, ak, mel, apw,
	Andrew Morton, Eric Whitney, William Lee Irwin III

On Wed, 2007-05-09 at 12:59 -0700, Nish Aravamudan wrote:
> [Adding wli to the Cc]
> 
> On 5/9/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> > On Fri, 2007-05-04 at 14:27 -0700, Christoph Lameter wrote:
> > > On Fri, 4 May 2007, Lee Schermerhorn wrote:
> > >
> > > > On Wed, 2007-05-02 at 21:21 -0500, Anton Blanchard wrote:
> > > > > An interesting bug was pointed out to me where we failed to allocate
> > > > > hugepages evenly. In the example below node 7 has no memory (it only has
> > > > > CPUs). Node 0 and 1 have plenty of free memory. After doing:

<snip>

> >
> > OK, here's a rework that exports a node_populated_map and associated
> > access functions from page_alloc.c where we already check for populated
> > zones.  Maybe this should be "node_hugepages_map" ?
> >
> > Also, we might consider exporting this to user space for applications
> > that want to "interleave across all nodes with hugepages"--not that
> > hugetlbfs mappings currently obey "vma policy".  Could still be used
> > with the "set task policy before allocating region" method [not that I
> > advocate this method ;-)].
> 
> For libhugetlbfs purposes, with 1.1 and later, we've recommended folks
> use numactl in coordination with the library to specify the policy.
> After a kernel fix that submitted a while back (and has been merged
> for at least a few releases), hugepages interleave properly when
> requested.

You mean using numactl command to preposition a hugetlb shmem seg
external to the application?  That's one way to do it.  Some apps like
to handle this internally themselves.  

<snip>
> >
> > By the way:  does anything protect the "static int nid" in
> > allocate_fresh_huge_page() from racing attempts to set nr_hugepages?
> > Can this happen?  Do we care?
> 
> Hrm, not sure if we care or not.

Shouldn't happen too often, I think.  And the only result should be some
additional imbalance that this patch is trying to address.  Still, I
don't know that it's worth another lock.  And, I don't think we want to
hold the hugetlb_lock over the page allocation.  However, with a slight
reordering of the code, with maybe an additional temporary nid variable,
we could grab the hugetlb lock while updating the static nid each time
around the loop.  I don't see this as a performance path, but again, is
it worth it?
 
> 
> We've got a draft of patch that exports nr_hugepages on a per-node
> basis in sysfs. Will post it soon, as an additional, more flexible
> interface for dealing with hugepages on NUMA.

You mean the ability to specify explicitly the number of hugepages per
node?  

> 
> <snip>
> 
> > -       page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> > -                                       HUGETLB_PAGE_ORDER);
> 
> <snip>
> 
> > +                       page = alloc_pages_node(nid,
> > +                                       GFP_HIGHUSER|__GFP_COMP|GFP_THISNODE,
> > +                                       HUGETLB_PAGE_ORDER);
> 
> Are we taking out the GFP_NOWARN for a reason? I noticed this in
> Anton's patch, but forgot to ask.

Actually, I hadn't noticed, but a quick look shows that GFP_THISNODE
contains the __GFP_NOWARN flag, as well as '_NORETRY which I think is
OK/desirable.

> 
> <snip>
> 
> > Index: Linux/include/linux/nodemask.h
> 
> <snip>
> 
> > + * node_set_poplated(node)             set bit 'node' in node_populated_map
> > + * node_not_poplated(node)             clear bit 'node' in node_populated_map
> 
> typos? (poplated v. populated)

Drat!  will fix on repost--next week, unless someone beats me to it.
> 
> <snip>
> 
> Looks reasonable otherwise.
> 
> Thanks,
> Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes - V2 -> V3
  2007-05-09 20:37         ` Lee Schermerhorn
@ 2007-05-09 20:54           ` Christoph Lameter
  2007-05-09 22:34           ` Nish Aravamudan
  1 sibling, 0 replies; 27+ messages in thread
From: Christoph Lameter @ 2007-05-09 20:54 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Nish Aravamudan, Anton Blanchard, linux-mm, ak, mel, apw,
	Andrew Morton, Eric Whitney, William Lee Irwin III

On Wed, 9 May 2007, Lee Schermerhorn wrote:

> > 
> > > +                       page = alloc_pages_node(nid,
> > > +                                       GFP_HIGHUSER|__GFP_COMP|GFP_THISNODE,
> > > +                                       HUGETLB_PAGE_ORDER);
> > 
> > Are we taking out the GFP_NOWARN for a reason? I noticed this in
> > Anton's patch, but forgot to ask.
> 
> Actually, I hadn't noticed, but a quick look shows that GFP_THISNODE
> contains the __GFP_NOWARN flag, as well as '_NORETRY which I think is
> OK/desirable.

It is required because GFP_THISNODE needs to fail if it cannot get memory 
from the right node.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes - V2 -> V3
  2007-05-09 20:37         ` Lee Schermerhorn
  2007-05-09 20:54           ` Christoph Lameter
@ 2007-05-09 22:34           ` Nish Aravamudan
  2007-05-15 16:30             ` Lee Schermerhorn
  1 sibling, 1 reply; 27+ messages in thread
From: Nish Aravamudan @ 2007-05-09 22:34 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, Anton Blanchard, linux-mm, ak, mel, apw,
	Andrew Morton, Eric Whitney, William Lee Irwin III

On 5/9/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> On Wed, 2007-05-09 at 12:59 -0700, Nish Aravamudan wrote:
> > [Adding wli to the Cc]
> >
> > On 5/9/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> > > On Fri, 2007-05-04 at 14:27 -0700, Christoph Lameter wrote:
> > > > On Fri, 4 May 2007, Lee Schermerhorn wrote:
> > > >
> > > > > On Wed, 2007-05-02 at 21:21 -0500, Anton Blanchard wrote:
> > > > > > An interesting bug was pointed out to me where we failed to allocate
> > > > > > hugepages evenly. In the example below node 7 has no memory (it only has
> > > > > > CPUs). Node 0 and 1 have plenty of free memory. After doing:
>
> <snip>
>
> > >
> > > OK, here's a rework that exports a node_populated_map and associated
> > > access functions from page_alloc.c where we already check for populated
> > > zones.  Maybe this should be "node_hugepages_map" ?
> > >
> > > Also, we might consider exporting this to user space for applications
> > > that want to "interleave across all nodes with hugepages"--not that
> > > hugetlbfs mappings currently obey "vma policy".  Could still be used
> > > with the "set task policy before allocating region" method [not that I
> > > advocate this method ;-)].

Hrm, I forgot to reply to that bit before. I think hugetlbfs mappings
will obey at least the interleave policy, per:

struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr)
{
        struct mempolicy *pol = get_vma_policy(current, vma, addr);

        if (pol->policy == MPOL_INTERLEAVE) {
                unsigned nid;

                nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
                return NODE_DATA(nid)->node_zonelists + gfp_zone(GFP_HIGHUSER);
        }
        return zonelist_policy(GFP_HIGHUSER, pol);
}

> > For libhugetlbfs purposes, with 1.1 and later, we've recommended folks
> > use numactl in coordination with the library to specify the policy.
> > After a kernel fix that submitted a while back (and has been merged
> > for at least a few releases), hugepages interleave properly when
> > requested.
>
> You mean using numactl command to preposition a hugetlb shmem seg
> external to the application?  That's one way to do it.  Some apps like
> to handle this internally themselves.

Hrm, well, libhugetlbfs does not do anything with shmem segs -- I was
referring to use numactl to set the policy for hugepages which then
our malloc implementation takes advantage of. Hugepages show up
interleaved across the nodes. This was what required a simple
mempolicy fix last August (3b98b087fc2daab67518d2baa8aef19a6ad82723)

> > > By the way:  does anything protect the "static int nid" in
> > > allocate_fresh_huge_page() from racing attempts to set nr_hugepages?
> > > Can this happen?  Do we care?
> >
> > Hrm, not sure if we care or not.
>
> Shouldn't happen too often, I think.  And the only result should be some
> additional imbalance that this patch is trying to address.  Still, I
> don't know that it's worth another lock.  And, I don't think we want to
> hold the hugetlb_lock over the page allocation.  However, with a slight
> reordering of the code, with maybe an additional temporary nid variable,
> we could grab the hugetlb lock while updating the static nid each time
> around the loop.  I don't see this as a performance path, but again, is
> it worth it?

I would say it's not, but will defer to wli et al.

> > We've got a draft of patch that exports nr_hugepages on a per-node
> > basis in sysfs. Will post it soon, as an additional, more flexible
> > interface for dealing with hugepages on NUMA.
>
> You mean the ability to specify explicitly the number of hugepages per
> node?

Yep, seems like a handy feature.

> > <snip>
> >
> > > -       page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> > > -                                       HUGETLB_PAGE_ORDER);
> >
> > <snip>
> >
> > > +                       page = alloc_pages_node(nid,
> > > +                                       GFP_HIGHUSER|__GFP_COMP|GFP_THISNODE,
> > > +                                       HUGETLB_PAGE_ORDER);
> >
> > Are we taking out the GFP_NOWARN for a reason? I noticed this in
> > Anton's patch, but forgot to ask.
>
> Actually, I hadn't noticed, but a quick look shows that GFP_THISNODE
> contains the __GFP_NOWARN flag, as well as '_NORETRY which I think is
> OK/desirable.

Good call, sorry for the noise. This makes sense (along with Christoph's reply).

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes - V2 -> V3
  2007-05-09 22:34           ` Nish Aravamudan
@ 2007-05-15 16:30             ` Lee Schermerhorn
  2007-05-16 23:47               ` Nish Aravamudan
  0 siblings, 1 reply; 27+ messages in thread
From: Lee Schermerhorn @ 2007-05-15 16:30 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: Christoph Lameter, Anton Blanchard, linux-mm, ak, mel, apw,
	Andrew Morton, Eric Whitney, William Lee Irwin III

On Wed, 2007-05-09 at 15:34 -0700, Nish Aravamudan wrote: 
> On 5/9/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> > On Wed, 2007-05-09 at 12:59 -0700, Nish Aravamudan wrote:
> > > [Adding wli to the Cc]

<snip>

> > > >
> > > > OK, here's a rework that exports a node_populated_map and associated
> > > > access functions from page_alloc.c where we already check for populated
> > > > zones.  Maybe this should be "node_hugepages_map" ?
> > > >
> > > > Also, we might consider exporting this to user space for applications
> > > > that want to "interleave across all nodes with hugepages"--not that
> > > > hugetlbfs mappings currently obey "vma policy".  Could still be used
> > > > with the "set task policy before allocating region" method [not that I
> > > > advocate this method ;-)].
> 
> Hrm, I forgot to reply to that bit before. I think hugetlbfs mappings
> will obey at least the interleave policy, per:
> 
> struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr)
> {
>         struct mempolicy *pol = get_vma_policy(current, vma, addr);
> 
>         if (pol->policy == MPOL_INTERLEAVE) {
>                 unsigned nid;
> 
>                 nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
>                 return NODE_DATA(nid)->node_zonelists + gfp_zone(GFP_HIGHUSER);
>         }
>         return zonelist_policy(GFP_HIGHUSER, pol);
> }


I should have been more specific.  I've only used hugetlbfs with shmem
segments.  I haven't tried libhugetlbfs [does it work on ia64, yet?].
Huge page shmem segments don't set/get the shared policy on the shared
object itself.  Rather they use the task private vma_policy.

Huge tlb shared segments would obey shared policy if the hugetlb_vm_ops
were hooked up to the shared policy mechanism, but they are not.  So,
there is no way to set the shared policy on such segments.  I tried a
simple patch to add the shmem_{get|set}_policy() functions to the
hugetlb_vm_ops, and this almost works.  However, a cat
of /proc/<pid>/numa_maps hangs.  Haven't investigated the hang, yet.

Anyway, here's the scenario to illustrate the "failure"--e.g., on
2.6.22-rc1:

from one task, create a huge page shmem segment:  shmget with
SHM_HUGETLB

map it [shmat] 

Now, either fork a child process or attach the shmem segment from
another task.

>From one task, mbind the segment with, say, interleave policy, but don't
touch it to fault in pages yet.  [If you use a separate task, rather
than a child of the first task, you can do the mbind before the attach.
However, a child will inherit the address space of the parent, so any
mbind before a fork will be inherited by the child.  A workaround, at
least.]

Touch the addresses from the child/other task to fault in the pages.
Unlike normal, non-huge, shmem segments, the huge pages faulted in by a
task which did not set/inherit the policy will not obey the policy.

Here's an "memtoy" script that will do this.  Requires at least
memtoy-0.11a/b [see http://free.linux.hp.com/~lts/Tools].  Assumes at
least 4G worth of huge pages distributed across 4 nodes.  Adjust script
according to your config.

shmem foo 4g huge
map foo
child x
mbind foo interleave 0,1,2,3
# touch foo from the child 'x'
/x touch foo
# what does parent see?
where foo
# what does child see?
/x where foo

run this script using:  'memtoy -v <path-to-script>'

> 
> > > For libhugetlbfs purposes, with 1.1 and later, we've recommended folks
> > > use numactl in coordination with the library to specify the policy.
> > > After a kernel fix that submitted a while back (and has been merged
> > > for at least a few releases), hugepages interleave properly when
> > > requested.
> >
> > You mean using numactl command to preposition a hugetlb shmem seg
> > external to the application?  That's one way to do it.  Some apps like
> > to handle this internally themselves.
> 
> Hrm, well, libhugetlbfs does not do anything with shmem segs -- I was
> referring to use numactl to set the policy for hugepages which then
> our malloc implementation takes advantage of. Hugepages show up
> interleaved across the nodes. This was what required a simple
> mempolicy fix last August (3b98b087fc2daab67518d2baa8aef19a6ad82723)
> 

Yep.  Saw that one go by.  Interleaving DOES work for hugetlb shmem
segments if, e.g., in the script above, you touch the segment from the
parent where the mbind was done, or if you do the mbind before forking
the child.

Lee


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes - V2 -> V3
  2007-05-09 19:17         ` Lee Schermerhorn
@ 2007-05-16 17:27           ` Nish Aravamudan
  2007-05-16 20:01             ` Lee Schermerhorn
  0 siblings, 1 reply; 27+ messages in thread
From: Nish Aravamudan @ 2007-05-16 17:27 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, Anton Blanchard, linux-mm, ak, mel, apw,
	Andrew Morton, Eric Whitney

On 5/9/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> On Wed, 2007-05-09 at 09:57 -0700, Christoph Lameter wrote:
> > On Wed, 9 May 2007, Lee Schermerhorn wrote:
> >
> > > +                                   HUGETLB_PAGE_ORDER);
> > > +
> > > +           nid = next_node(nid, node_online_map);
> > > +           if (nid == MAX_NUMNODES)
> > > +                   nid = first_node(node_online_map);
> >
> > Maybe use nr_node_ids here? May save some scanning over online maps?
>
> Good idea.  I won't get to it until next week.  Maybe we'll have more
> comments by then.
>
> Anton:  anything to add?

Actually, I was going to ask? Why don't we just iterate over
node_populated_map? You've exported it and everything... Rather than
going over some other map and then checking to see if the node is
populated every time?

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes - V2 -> V3
  2007-05-09 16:37     ` [PATCH] Fix hugetlb pool allocation with empty nodes - V2 -> V3 Lee Schermerhorn
  2007-05-09 16:57       ` Christoph Lameter
  2007-05-09 19:59       ` Nish Aravamudan
@ 2007-05-16 19:59       ` Nish Aravamudan
  2007-05-16 20:32         ` Lee Schermerhorn
  2007-05-16 22:17         ` [PATCH/RFC] Fix hugetlb pool allocation with empty nodes - V4 Lee Schermerhorn
  2 siblings, 2 replies; 27+ messages in thread
From: Nish Aravamudan @ 2007-05-16 19:59 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, Anton Blanchard, linux-mm, ak, mel, apw,
	Andrew Morton, Eric Whitney, andyw

On 5/9/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> On Fri, 2007-05-04 at 14:27 -0700, Christoph Lameter wrote:
> > On Fri, 4 May 2007, Lee Schermerhorn wrote:
> >
> > > On Wed, 2007-05-02 at 21:21 -0500, Anton Blanchard wrote:
> > > > An interesting bug was pointed out to me where we failed to allocate
> > > > hugepages evenly. In the example below node 7 has no memory (it only has
> > > > CPUs). Node 0 and 1 have plenty of free memory. After doing:
> > >
> > > Here's my attempt to fix the problem [I see it on HP platforms as well],
> > > without removing the population check in build_zonelists_node().  Seems
> > > to work.
> >
> > I think we need something like for_each_online_node for each node with
> > memory otherwise we are going to replicate this all over the place for
> > memoryless nodes. Add a nodemap for populated nodes?
> >
> > I.e.
> >
> > for_each_mem_node?
> >
> > Then you do not have to check the zone flags all the time. May avoid a lot
> > of mess?
>
> OK, here's a rework that exports a node_populated_map and associated
> access functions from page_alloc.c where we already check for populated
> zones.  Maybe this should be "node_hugepages_map" ?
>
> Also, we might consider exporting this to user space for applications
> that want to "interleave across all nodes with hugepages"--not that
> hugetlbfs mappings currently obey "vma policy".  Could still be used
> with the "set task policy before allocating region" method [not that I
> advocate this method ;-)].
>
> I don't think that a 'for_each_*_node()' macro is appropriate for this
> usage, as allocate_fresh_huge_page() is an "incremental allocator" that
> returns a page from the "next eligible node" on each call.
>
> By the way:  does anything protect the "static int nid" in
> allocate_fresh_huge_page() from racing attempts to set nr_hugepages?
> Can this happen?  Do we care?
>
> Again, I chose to rework Anton's original patch, maintaining his
> rationale/discussion, rather create a separate patch.  Note the "Rework"
> comments therein--especially regarding NORMAL zone.  I expect we'll need
> a few more rounds of "discussion" on this issue.  And, it'll require
> rework to merge with the "change zonelist order" series that hits the
> same area.
>
> Lee
>
> [PATCH] Fix hugetlb pool allocation with empty nodes - V3

<snip>

===================================================================
> --- Linux.orig/mm/page_alloc.c  2007-05-08 11:47:45.000000000 -0400
> +++ Linux/mm/page_alloc.c       2007-05-09 11:16:27.000000000 -0400

<snip>

> @@ -2021,11 +2024,14 @@ void show_free_areas(void)
>   * Builds allocation fallback zone lists.
>   *
>   * Add all populated zones of a node to the zonelist.
> + * Record nodes with populated gfp_zone(GFP_HIGHUSER) for huge page allocation.
>   */
>  static int __meminit build_zonelists_node(pg_data_t *pgdat,
> -                       struct zonelist *zonelist, int nr_zones, enum zone_type zone_type)
> +                       struct zonelist *zonelist, int nr_zones,
> +                       enum zone_type zone_type)
>  {
>         struct zone *zone;
> +       enum zone_type zone_highuser = gfp_zone(GFP_HIGHUSER);
>
>         BUG_ON(zone_type >= MAX_NR_ZONES);
>         zone_type++;
> @@ -2036,7 +2042,10 @@ static int __meminit build_zonelists_nod
>                 if (populated_zone(zone)) {
>                         zonelist->zones[nr_zones++] = zone;
>                         check_highest_zone(zone_type);
> -               }
> +                       if (zone_type == zone_highuser)
> +                               node_set_populated(pgdat->node_id);
> +               } else if (zone_type == zone_highuser)
> +                       node_not_populated(pgdat->node_id);
>
>         } while (zone_type);
>         return nr_zones;

This completely breaks hugepage allocation on 4-node x86_64 box I have
here. Each node has <4GB of memory, so all memory is ZONE_DMA and
ZONE_DMA32. gfp_zone(GFP_HIGHUSER) is ZONE_NORMAL, though. So all
nodes are not populated by the default initialization to an empty
nodemask.

Thanks to Andy Whitcroft for helping me debug this.

I'm not sure how to fix this -- but I ran into while trying to base my
sysfs hugepage allocation patches on top of yours.

Thoughts?

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes - V2 -> V3
  2007-05-16 17:27           ` Nish Aravamudan
@ 2007-05-16 20:01             ` Lee Schermerhorn
  0 siblings, 0 replies; 27+ messages in thread
From: Lee Schermerhorn @ 2007-05-16 20:01 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: Christoph Lameter, Anton Blanchard, linux-mm, ak, mel, apw,
	Andrew Morton, Eric Whitney

On Wed, 2007-05-16 at 10:27 -0700, Nish Aravamudan wrote:
> On 5/9/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> > On Wed, 2007-05-09 at 09:57 -0700, Christoph Lameter wrote:
> > > On Wed, 9 May 2007, Lee Schermerhorn wrote:
> > >
> > > > +                                   HUGETLB_PAGE_ORDER);
> > > > +
> > > > +           nid = next_node(nid, node_online_map);
> > > > +           if (nid == MAX_NUMNODES)
> > > > +                   nid = first_node(node_online_map);
> > >
> > > Maybe use nr_node_ids here? May save some scanning over online maps?
> >
> > Good idea.  I won't get to it until next week.  Maybe we'll have more
> > comments by then.
> >
> > Anton:  anything to add?
> 
> Actually, I was going to ask? Why don't we just iterate over
> node_populated_map? You've exported it and everything... Rather than
> going over some other map and then checking to see if the node is
> populated every time?
> 

Uh... tunnel vision?

I'm testing a reworked patch, against 2.6.22-rc1-mm1.  Will post
shortly, I hope.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes - V2 -> V3
  2007-05-16 19:59       ` Nish Aravamudan
@ 2007-05-16 20:32         ` Lee Schermerhorn
  2007-05-16 22:17         ` [PATCH/RFC] Fix hugetlb pool allocation with empty nodes - V4 Lee Schermerhorn
  1 sibling, 0 replies; 27+ messages in thread
From: Lee Schermerhorn @ 2007-05-16 20:32 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: Christoph Lameter, Anton Blanchard, linux-mm, ak, mel, apw,
	Andrew Morton, Eric Whitney, andyw

On Wed, 2007-05-16 at 12:59 -0700, Nish Aravamudan wrote:


> 
> This completely breaks hugepage allocation on 4-node x86_64 box I have
> here. Each node has <4GB of memory, so all memory is ZONE_DMA and
> ZONE_DMA32. gfp_zone(GFP_HIGHUSER) is ZONE_NORMAL, though. So all
> nodes are not populated by the default initialization to an empty
> nodemask.
> 
> Thanks to Andy Whitcroft for helping me debug this.
> 
> I'm not sure how to fix this -- but I ran into while trying to base my
> sysfs hugepage allocation patches on top of yours.
> 
> Thoughts?

I'm looking at this now on a 2 socket x86_64 blade with 2GB/node.  Just
noticed that I'm not seeing any nodes in the node_populated_map.
Haven't figured out how to fix it yet.  Stand by...

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH/RFC] Fix hugetlb pool allocation with empty nodes - V4
  2007-05-16 19:59       ` Nish Aravamudan
  2007-05-16 20:32         ` Lee Schermerhorn
@ 2007-05-16 22:17         ` Lee Schermerhorn
  2007-05-18  0:30           ` Nish Aravamudan
  1 sibling, 1 reply; 27+ messages in thread
From: Lee Schermerhorn @ 2007-05-16 22:17 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: Christoph Lameter, Anton Blanchard, linux-mm, ak, mel, apw,
	Andrew Morton, Eric Whitney, andyw

On Wed, 2007-05-16 at 12:59 -0700, Nish Aravamudan wrote:

<snip>
> 
> This completely breaks hugepage allocation on 4-node x86_64 box I have
> here. Each node has <4GB of memory, so all memory is ZONE_DMA and
> ZONE_DMA32. gfp_zone(GFP_HIGHUSER) is ZONE_NORMAL, though. So all
> nodes are not populated by the default initialization to an empty
> nodemask.
> 
> Thanks to Andy Whitcroft for helping me debug this.
> 
> I'm not sure how to fix this -- but I ran into while trying to base my
> sysfs hugepage allocation patches on top of yours.

OK.  Try this.  Tested OK on 4 node [+ 1 pseudo node] ia64 and 2 node
x86_64.  The x86_64 had 2G per node--all DMA32.

Notes:

1) applies on 2.6.22-rc1-mm1 atop my earlier patch to add the call to
check_highest_zone() to build_zonelists_in_zone_order().  I think it'll
apply [with offsets] w/o that patch.

2) non-NUMA case not tested.

3) note the redefinition of the populated map and the possible side
effects with some non-symmetric node memory configurations.

4) this is an RFC.  As a minimum, I'll need to repost with a cleaner
patch description, I think.  I.e., minus these notes...

5) I've retained Anton's original sign off.  Hope that's OK...

Lee

[PATCH 2.6.22-rc1-mm1] Fix hugetlb pool allocation with empty nodes V4

Original Patch [V1]:

Date:	Wed, 2 May 2007 21:21:07 -0500
From: Anton Blanchard <anton@samba.org>
To: linux-mm@kvack.org, clameter@SGI.com, ak@suse.de
Cc: nish.aravamudan@gmail.com, mel@csn.ul.ie, apw@shadowen.org
Subject: [PATCH] Fix hugetlb pool allocation with empty nodes

An interesting bug was pointed out to me where we failed to allocate
hugepages evenly. In the example below node 7 has no memory (it only has
CPUs). Node 0 and 1 have plenty of free memory. After doing:

# echo 16 > /proc/sys/vm/nr_hugepages

We see the imbalance:

# cat /sys/devices/system/node/node*/meminfo|grep HugePages_Total
Node 0 HugePages_Total:     6
Node 1 HugePages_Total:     10
Node 7 HugePages_Total:     0

It didn't take long to realise that alloc_fresh_huge_page is allocating
from node 7 without GFP_THISNODE set, so we fallback to its next
preferred node (ie 1). This means we end up with a 1/3 2/3 imbalance.

After fixing this it still didnt work, and after some more poking I see
why. When building our fallback zonelist in build_zonelists_node we
skip empty zones. This means zone 7 never registers node 7's empty
zonelists and instead registers node 1's. Therefore when we ask for a
page from node 7, using the GFP_THISNODE flag we end up with node 1
memory.

By removing the populated_zone() check in build_zonelists_node we fix
the problem:

# cat /sys/devices/system/node/node*/meminfo|grep HugePages_Total
Node 0 HugePages_Total:     8
Node 1 HugePages_Total:     8
Node 7 HugePages_Total:     0

Im guessing registering empty remote zones might make the SGI guys a bit
unhappy, maybe we should just force the registration of empty local
zones? Does anyone care?

Rework [should have been V3]

Create node_populated_map and access functions [nodemask.h] to describe
nodes with populated gfp_zone(GFP_HIGHUSER).  Note that on x86, this
excludes nodes with only DMA* or NORMAL memory--i.e., no hugepages below
4G.  

Populate the map while building zonelists, where we already check for
populated zones.  This is early enough for boot time allocation of
huge pages.

Attempt to allocate "fresh" huge pages only from nodes in the populated
map.

Tested on ia64 numa platform with both boot time and run time allocation
of huge pages.

Rework "V4":

+ rebase to 22-rc1-mm1 with "change zonelist order" series:

+ redefine node_populated_map to contain nodes whose first zone in the
  'policy_zone' zonelist is "on node".  This will be used to filter nodes
  for hugepage allocation.  Note:  if some node has only DMA32, but
  policy_zone is > DMA32 [some other node/s has/have memory in higher
  zones] AND we're building the zonelists in zone order, we won't mark
  this zone as populated.  No hugepages will be allocated from such a
  node.

+ fix typos in comments per Nish Aravamudan.

+ rework allocate_fresh_huge_page() to just scan the populated map,
  again Nish's suggestion.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Anton Blanchard <anton@samba.org>

 include/linux/nodemask.h |   10 ++++++++++
 mm/hugetlb.c             |   21 +++++++++++++++------
 mm/page_alloc.c          |   33 +++++++++++++++++++++++++++++++--
 3 files changed, 56 insertions(+), 8 deletions(-)

Index: Linux/mm/hugetlb.c
===================================================================
--- Linux.orig/mm/hugetlb.c	2007-05-16 17:45:38.000000000 -0400
+++ Linux/mm/hugetlb.c	2007-05-16 17:47:54.000000000 -0400
@@ -105,13 +105,22 @@ static void free_huge_page(struct page *
 
 static int alloc_fresh_huge_page(void)
 {
-	static int nid = 0;
+	static int nid = -1;
 	struct page *page;
-	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
-					HUGETLB_PAGE_ORDER);
-	nid = next_node(nid, node_online_map);
-	if (nid == MAX_NUMNODES)
-		nid = first_node(node_online_map);
+	int start_nid;
+
+	if (nid < 0)
+		nid = first_node(node_populated_map);
+	start_nid = nid;
+
+	do {
+		page = alloc_pages_node(nid,
+				GFP_HIGHUSER|__GFP_COMP|GFP_THISNODE,
+  				HUGETLB_PAGE_ORDER);
+		nid = next_node(nid, node_populated_map);
+		if (nid >= nr_node_ids)
+			nid = first_node(node_populated_map);
+	} while (!page && nid != start_nid);
 	if (page) {
 		set_compound_page_dtor(page, free_huge_page);
 		spin_lock(&hugetlb_lock);
Index: Linux/include/linux/nodemask.h
===================================================================
--- Linux.orig/include/linux/nodemask.h	2007-05-16 17:45:38.000000000 -0400
+++ Linux/include/linux/nodemask.h	2007-05-16 17:47:54.000000000 -0400
@@ -64,12 +64,16 @@
  *
  * int node_online(node)		Is some node online?
  * int node_possible(node)		Is some node possible?
+ * int node_populated(node)		Is some node populated [at policy_zone]
  *
  * int any_online_node(mask)		First online node in mask
  *
  * node_set_online(node)		set bit 'node' in node_online_map
  * node_set_offline(node)		clear bit 'node' in node_online_map
  *
+ * node_set_populated(node)		set bit 'node' in node_populated_map
+ * node_not_populated(node)		clear bit 'node' in node_populated_map
+ *
  * for_each_node(node)			for-loop node over node_possible_map
  * for_each_online_node(node)		for-loop node over node_online_map
  *
@@ -344,12 +348,14 @@ static inline void __nodes_remap(nodemas
 
 extern nodemask_t node_online_map;
 extern nodemask_t node_possible_map;
+extern nodemask_t node_populated_map;
 
 #if MAX_NUMNODES > 1
 #define num_online_nodes()	nodes_weight(node_online_map)
 #define num_possible_nodes()	nodes_weight(node_possible_map)
 #define node_online(node)	node_isset((node), node_online_map)
 #define node_possible(node)	node_isset((node), node_possible_map)
+#define node_populated(node)	node_isset((node), node_populated_map)
 #define first_online_node	first_node(node_online_map)
 #define next_online_node(nid)	next_node((nid), node_online_map)
 extern int nr_node_ids;
@@ -358,6 +364,7 @@ extern int nr_node_ids;
 #define num_possible_nodes()	1
 #define node_online(node)	((node) == 0)
 #define node_possible(node)	((node) == 0)
+#define node_populated(node)	((node) == 0)
 #define first_online_node	0
 #define next_online_node(nid)	(MAX_NUMNODES)
 #define nr_node_ids		1
@@ -375,6 +382,9 @@ extern int nr_node_ids;
 #define node_set_online(node)	   set_bit((node), node_online_map.bits)
 #define node_set_offline(node)	   clear_bit((node), node_online_map.bits)
 
+#define node_set_populated(node)   set_bit((node), node_populated_map.bits)
+#define node_not_populated(node)   clear_bit((node), node_populated_map.bits)
+
 #define for_each_node(node)	   for_each_node_mask((node), node_possible_map)
 #define for_each_online_node(node) for_each_node_mask((node), node_online_map)
 
Index: Linux/mm/page_alloc.c
===================================================================
--- Linux.orig/mm/page_alloc.c	2007-05-16 17:47:53.000000000 -0400
+++ Linux/mm/page_alloc.c	2007-05-16 17:47:54.000000000 -0400
@@ -54,6 +54,9 @@ nodemask_t node_online_map __read_mostly
 EXPORT_SYMBOL(node_online_map);
 nodemask_t node_possible_map __read_mostly = NODE_MASK_ALL;
 EXPORT_SYMBOL(node_possible_map);
+nodemask_t node_populated_map __read_mostly = NODE_MASK_NONE;
+EXPORT_SYMBOL(node_populated_map);
+
 unsigned long totalram_pages __read_mostly;
 unsigned long totalreserve_pages __read_mostly;
 long nr_swap_pages;
@@ -2203,7 +2206,7 @@ static int node_order[MAX_NUMNODES];
 static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
 {
 	enum zone_type i;
-	int pos, j, node;
+	int pos, j;
 	int zone_type;		/* needs to be signed */
 	struct zone *z;
 	struct zonelist *zonelist;
@@ -2213,7 +2216,7 @@ static void build_zonelists_in_zone_orde
 		pos = 0;
 		for (zone_type = i; zone_type >= 0; zone_type--) {
 			for (j = 0; j < nr_nodes; j++) {
-				node = node_order[j];
+				int node = node_order[j];
 				z = &NODE_DATA(node)->node_zones[zone_type];
 				if (populated_zone(z)) {
 					zonelist->zones[pos++] = z;
@@ -2286,6 +2289,22 @@ static void set_zonelist_order(void)
 		current_zonelist_order = user_zonelist_order;
 }
 
+/*
+ * setup_populate_map() - record nodes whose "policy_zone" is "on-node".
+ */
+static void setup_populated_map(int nid)
+{
+	pg_data_t *pgdat = NODE_DATA(nid);
+	struct zonelist *zl = pgdat->node_zonelists + policy_zone;
+	struct zone *z = zl->zones[0];
+
+	VM_BUG_ON(!z);
+	if (z->zone_pgdat == pgdat)
+		node_set_populated(nid);
+	else
+		node_not_populated(nid);
+}
+
 static void build_zonelists(pg_data_t *pgdat)
 {
 	int j, node, load;
@@ -2369,6 +2388,15 @@ static void set_zonelist_order(void)
 	current_zonelist_order = ZONELIST_ORDER_ZONE;
 }
 
+/*
+ * setup_populated_map - non-NUMA case
+ * Only node 0 should be on-line, and it MUST be populated!
+ */
+static void setup_populated_map(int nid)
+{
+	node_set_populated(nid);
+}
+
 static void build_zonelists(pg_data_t *pgdat)
 {
 	int node, local_node;
@@ -2423,6 +2451,7 @@ static int __build_all_zonelists(void *d
 	for_each_online_node(nid) {
 		build_zonelists(NODE_DATA(nid));
 		build_zonelist_cache(NODE_DATA(nid));
+		setup_populated_map(nid);
 	}
 	return 0;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Fix hugetlb pool allocation with empty nodes - V2 -> V3
  2007-05-15 16:30             ` Lee Schermerhorn
@ 2007-05-16 23:47               ` Nish Aravamudan
  0 siblings, 0 replies; 27+ messages in thread
From: Nish Aravamudan @ 2007-05-16 23:47 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, Anton Blanchard, linux-mm, ak, mel, apw,
	Andrew Morton, Eric Whitney, William Lee Irwin III

On 5/15/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> On Wed, 2007-05-09 at 15:34 -0700, Nish Aravamudan wrote:
> > On 5/9/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> > > On Wed, 2007-05-09 at 12:59 -0700, Nish Aravamudan wrote:
> > > > [Adding wli to the Cc]
>
> <snip>
>
> > > > >
> > > > > OK, here's a rework that exports a node_populated_map and associated
> > > > > access functions from page_alloc.c where we already check for populated
> > > > > zones.  Maybe this should be "node_hugepages_map" ?
> > > > >
> > > > > Also, we might consider exporting this to user space for applications
> > > > > that want to "interleave across all nodes with hugepages"--not that
> > > > > hugetlbfs mappings currently obey "vma policy".  Could still be used
> > > > > with the "set task policy before allocating region" method [not that I
> > > > > advocate this method ;-)].
> >
> > Hrm, I forgot to reply to that bit before. I think hugetlbfs mappings
> > will obey at least the interleave policy, per:
> >
> > struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr)
> > {
> >         struct mempolicy *pol = get_vma_policy(current, vma, addr);
> >
> >         if (pol->policy == MPOL_INTERLEAVE) {
> >                 unsigned nid;
> >
> >                 nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
> >                 return NODE_DATA(nid)->node_zonelists + gfp_zone(GFP_HIGHUSER);
> >         }
> >         return zonelist_policy(GFP_HIGHUSER, pol);
> > }
>
>
> I should have been more specific.  I've only used hugetlbfs with shmem
> segments.  I haven't tried libhugetlbfs [does it work on ia64, yet?].
> Huge page shmem segments don't set/get the shared policy on the shared
> object itself.  Rather they use the task private vma_policy.

Ah ok. FWIW, libhugetlbfs does have pseudo-support for IA64. At least,
we build and can run the gerernal functionality tests (those that make
sense on IA64). We avoid tests that specify MAP_FIXED, for instance.
We don't currently have plans to support segment remapping on IA64,
because it's a bit tougher than on other architectures due hugepage
location restrictions. Hugepage malloc should work, though. And since
we only use mlock() in the malloc code, that's where you'd want to use
numactl anyways to specify the right policy.

> Huge tlb shared segments would obey shared policy if the hugetlb_vm_ops
> were hooked up to the shared policy mechanism, but they are not.  So,
> there is no way to set the shared policy on such segments.  I tried a
> simple patch to add the shmem_{get|set}_policy() functions to the
> hugetlb_vm_ops, and this almost works.  However, a cat
> of /proc/<pid>/numa_maps hangs.  Haven't investigated the hang, yet.

<snip>

Interesting -- I'll see if I can take a look at this sometime soon. Do
you have a version (working or not) of your patch that you'd be
willing to share?

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH/RFC] Fix hugetlb pool allocation with empty nodes - V4
  2007-05-16 22:17         ` [PATCH/RFC] Fix hugetlb pool allocation with empty nodes - V4 Lee Schermerhorn
@ 2007-05-18  0:30           ` Nish Aravamudan
  2007-05-21 14:57             ` Lee Schermerhorn
  0 siblings, 1 reply; 27+ messages in thread
From: Nish Aravamudan @ 2007-05-18  0:30 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, Anton Blanchard, linux-mm, ak, mel, apw,
	Andrew Morton, Eric Whitney, andyw

On 5/16/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> On Wed, 2007-05-16 at 12:59 -0700, Nish Aravamudan wrote:
>
> <snip>
> >
> > This completely breaks hugepage allocation on 4-node x86_64 box I have
> > here. Each node has <4GB of memory, so all memory is ZONE_DMA and
> > ZONE_DMA32. gfp_zone(GFP_HIGHUSER) is ZONE_NORMAL, though. So all
> > nodes are not populated by the default initialization to an empty
> > nodemask.
> >
> > Thanks to Andy Whitcroft for helping me debug this.
> >
> > I'm not sure how to fix this -- but I ran into while trying to base my
> > sysfs hugepage allocation patches on top of yours.
>
> OK.  Try this.  Tested OK on 4 node [+ 1 pseudo node] ia64 and 2 node
> x86_64.  The x86_64 had 2G per node--all DMA32.
>
> Notes:
>
> 1) applies on 2.6.22-rc1-mm1 atop my earlier patch to add the call to
> check_highest_zone() to build_zonelists_in_zone_order().  I think it'll
> apply [with offsets] w/o that patch.

Could you give both patches (or just this one) against 2.6.22-rc1 or
current -linus? -mm1 has build issues on ppc64 and i386 (as reported
by Andy and Mel in other threads).

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH/RFC] Fix hugetlb pool allocation with empty nodes - V4
  2007-05-18  0:30           ` Nish Aravamudan
@ 2007-05-21 14:57             ` Lee Schermerhorn
  2007-05-21 17:51               ` Nish Aravamudan
  0 siblings, 1 reply; 27+ messages in thread
From: Lee Schermerhorn @ 2007-05-21 14:57 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: Christoph Lameter, Anton Blanchard, linux-mm, ak, mel, apw,
	Andrew Morton, Eric Whitney, andyw

On Thu, 2007-05-17 at 17:30 -0700, Nish Aravamudan wrote:
> On 5/16/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> > On Wed, 2007-05-16 at 12:59 -0700, Nish Aravamudan wrote:
> >
> > <snip>
> > >
> > > This completely breaks hugepage allocation on 4-node x86_64 box I have
> > > here. Each node has <4GB of memory, so all memory is ZONE_DMA and
> > > ZONE_DMA32. gfp_zone(GFP_HIGHUSER) is ZONE_NORMAL, though. So all
> > > nodes are not populated by the default initialization to an empty
> > > nodemask.
> > >
> > > Thanks to Andy Whitcroft for helping me debug this.
> > >
> > > I'm not sure how to fix this -- but I ran into while trying to base my
> > > sysfs hugepage allocation patches on top of yours.
> >
> > OK.  Try this.  Tested OK on 4 node [+ 1 pseudo node] ia64 and 2 node
> > x86_64.  The x86_64 had 2G per node--all DMA32.
> >
> > Notes:
> >
> > 1) applies on 2.6.22-rc1-mm1 atop my earlier patch to add the call to
> > check_highest_zone() to build_zonelists_in_zone_order().  I think it'll
> > apply [with offsets] w/o that patch.
> 
> Could you give both patches (or just this one) against 2.6.22-rc1 or
> current -linus? -mm1 has build issues on ppc64 and i386 (as reported
> by Andy and Mel in other threads).

Nish:

Here's the hugepage fix against 2.6.22-rc2 for your testing.  I think
the patch still needs to cook in -mm if you think it's OK.

Lee

[PATCH 2.6.22-rc2] Fix hugetlb pool allocation with empty nodes V4a

Temporary patch against 2.6.22-rc2 for Nish's testing.  

description and signoffs intentionally omitted


 include/linux/nodemask.h |   10 ++++++++++
 mm/hugetlb.c             |   21 +++++++++++++++------
 mm/page_alloc.c          |   29 +++++++++++++++++++++++++++++
 3 files changed, 54 insertions(+), 6 deletions(-)

Index: Linux/mm/hugetlb.c
===================================================================
--- Linux.orig/mm/hugetlb.c	2007-05-21 09:22:53.000000000 -0400
+++ Linux/mm/hugetlb.c	2007-05-21 10:00:45.000000000 -0400
@@ -101,13 +101,22 @@ static void free_huge_page(struct page *
 
 static int alloc_fresh_huge_page(void)
 {
-	static int nid = 0;
+	static int nid = -1;
 	struct page *page;
-	page = alloc_pages_node(nid, GFP_HIGHUSER|__GFP_COMP|__GFP_NOWARN,
-					HUGETLB_PAGE_ORDER);
-	nid = next_node(nid, node_online_map);
-	if (nid == MAX_NUMNODES)
-		nid = first_node(node_online_map);
+	int start_nid;
+
+	if (nid < 0)
+		nid = first_node(node_populated_map);
+	start_nid = nid;
+
+	do {
+		page = alloc_pages_node(nid,
+				GFP_HIGHUSER|__GFP_COMP|GFP_THISNODE,
+  				HUGETLB_PAGE_ORDER);
+		nid = next_node(nid, node_populated_map);
+		if (nid >= nr_node_ids)
+			nid = first_node(node_populated_map);
+	} while (!page && nid != start_nid);
 	if (page) {
 		set_compound_page_dtor(page, free_huge_page);
 		spin_lock(&hugetlb_lock);
Index: Linux/include/linux/nodemask.h
===================================================================
--- Linux.orig/include/linux/nodemask.h	2007-04-25 23:08:32.000000000 -0400
+++ Linux/include/linux/nodemask.h	2007-05-21 09:59:16.000000000 -0400
@@ -64,12 +64,16 @@
  *
  * int node_online(node)		Is some node online?
  * int node_possible(node)		Is some node possible?
+ * int node_populated(node)		Is some node populated [at policy_zone]
  *
  * int any_online_node(mask)		First online node in mask
  *
  * node_set_online(node)		set bit 'node' in node_online_map
  * node_set_offline(node)		clear bit 'node' in node_online_map
  *
+ * node_set_populated(node)		set bit 'node' in node_populated_map
+ * node_not_populated(node)		clear bit 'node' in node_populated_map
+ *
  * for_each_node(node)			for-loop node over node_possible_map
  * for_each_online_node(node)		for-loop node over node_online_map
  *
@@ -344,12 +348,14 @@ static inline void __nodes_remap(nodemas
 
 extern nodemask_t node_online_map;
 extern nodemask_t node_possible_map;
+extern nodemask_t node_populated_map;
 
 #if MAX_NUMNODES > 1
 #define num_online_nodes()	nodes_weight(node_online_map)
 #define num_possible_nodes()	nodes_weight(node_possible_map)
 #define node_online(node)	node_isset((node), node_online_map)
 #define node_possible(node)	node_isset((node), node_possible_map)
+#define node_populated(node)	node_isset((node), node_populated_map)
 #define first_online_node	first_node(node_online_map)
 #define next_online_node(nid)	next_node((nid), node_online_map)
 extern int nr_node_ids;
@@ -358,6 +364,7 @@ extern int nr_node_ids;
 #define num_possible_nodes()	1
 #define node_online(node)	((node) == 0)
 #define node_possible(node)	((node) == 0)
+#define node_populated(node)	((node) == 0)
 #define first_online_node	0
 #define next_online_node(nid)	(MAX_NUMNODES)
 #define nr_node_ids		1
@@ -375,6 +382,9 @@ extern int nr_node_ids;
 #define node_set_online(node)	   set_bit((node), node_online_map.bits)
 #define node_set_offline(node)	   clear_bit((node), node_online_map.bits)
 
+#define node_set_populated(node)   set_bit((node), node_populated_map.bits)
+#define node_not_populated(node)   clear_bit((node), node_populated_map.bits)
+
 #define for_each_node(node)	   for_each_node_mask((node), node_possible_map)
 #define for_each_online_node(node) for_each_node_mask((node), node_online_map)
 
Index: Linux/mm/page_alloc.c
===================================================================
--- Linux.orig/mm/page_alloc.c	2007-05-21 09:22:53.000000000 -0400
+++ Linux/mm/page_alloc.c	2007-05-21 10:03:17.000000000 -0400
@@ -54,6 +54,9 @@ nodemask_t node_online_map __read_mostly
 EXPORT_SYMBOL(node_online_map);
 nodemask_t node_possible_map __read_mostly = NODE_MASK_ALL;
 EXPORT_SYMBOL(node_possible_map);
+nodemask_t node_populated_map __read_mostly = NODE_MASK_NONE;
+EXPORT_SYMBOL(node_populated_map);
+
 unsigned long totalram_pages __read_mostly;
 unsigned long totalreserve_pages __read_mostly;
 long nr_swap_pages;
@@ -1719,6 +1722,22 @@ static int __meminit find_next_best_node
 	return best_node;
 }
 
+/*
+ * setup_populate_map() - record nodes whose "policy_zone" is "on-node".
+ */
+static void setup_populated_map(int nid)
+{
+	pg_data_t *pgdat = NODE_DATA(nid);
+	struct zonelist *zl = pgdat->node_zonelists + policy_zone;
+	struct zone *z = zl->zones[0];
+
+	VM_BUG_ON(!z);
+	if (z->zone_pgdat == pgdat)
+		node_set_populated(nid);
+	else
+		node_not_populated(nid);
+}
+
 static void __meminit build_zonelists(pg_data_t *pgdat)
 {
 	int j, node, local_node;
@@ -1788,6 +1807,15 @@ static void __meminit build_zonelist_cac
 
 #else	/* CONFIG_NUMA */
 
+/*
+ * setup_populated_map - non-NUMA case
+ * Only node 0 should be on-line, and it MUST be populated!
+ */
+static void setup_populated_map(int nid)
+{
+	node_set_populated(nid);
+}
+
 static void __meminit build_zonelists(pg_data_t *pgdat)
 {
 	int node, local_node;
@@ -1842,6 +1870,7 @@ static int __meminit __build_all_zonelis
 	for_each_online_node(nid) {
 		build_zonelists(NODE_DATA(nid));
 		build_zonelist_cache(NODE_DATA(nid));
+		setup_populated_map(nid);
 	}
 	return 0;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH/RFC] Fix hugetlb pool allocation with empty nodes - V4
  2007-05-21 14:57             ` Lee Schermerhorn
@ 2007-05-21 17:51               ` Nish Aravamudan
  0 siblings, 0 replies; 27+ messages in thread
From: Nish Aravamudan @ 2007-05-21 17:51 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, Anton Blanchard, linux-mm, ak, mel, apw,
	Andrew Morton, Eric Whitney, andyw

On 5/21/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> On Thu, 2007-05-17 at 17:30 -0700, Nish Aravamudan wrote:
> > On 5/16/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> > > On Wed, 2007-05-16 at 12:59 -0700, Nish Aravamudan wrote:
> > >
> > > <snip>
> > > >
> > > > This completely breaks hugepage allocation on 4-node x86_64 box I have
> > > > here. Each node has <4GB of memory, so all memory is ZONE_DMA and
> > > > ZONE_DMA32. gfp_zone(GFP_HIGHUSER) is ZONE_NORMAL, though. So all
> > > > nodes are not populated by the default initialization to an empty
> > > > nodemask.
> > > >
> > > > Thanks to Andy Whitcroft for helping me debug this.
> > > >
> > > > I'm not sure how to fix this -- but I ran into while trying to base my
> > > > sysfs hugepage allocation patches on top of yours.
> > >
> > > OK.  Try this.  Tested OK on 4 node [+ 1 pseudo node] ia64 and 2 node
> > > x86_64.  The x86_64 had 2G per node--all DMA32.
> > >
> > > Notes:
> > >
> > > 1) applies on 2.6.22-rc1-mm1 atop my earlier patch to add the call to
> > > check_highest_zone() to build_zonelists_in_zone_order().  I think it'll
> > > apply [with offsets] w/o that patch.
> >
> > Could you give both patches (or just this one) against 2.6.22-rc1 or
> > current -linus? -mm1 has build issues on ppc64 and i386 (as reported
> > by Andy and Mel in other threads).
>
> Nish:
>
> Here's the hugepage fix against 2.6.22-rc2 for your testing.  I think
> the patch still needs to cook in -mm if you think it's OK.

Ok, will test against:

1-node (non-NUMA), 2-node x86
4-node x86_64
4-node ppc64

Will let you know what happens as soon as they all finish.

I definitely agree that the patch should stew in -mm. Presuming the
testing goes well, I can rebase my per-node allocation interface on
top (no comments there yet, though).

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2007-05-21 17:51 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-05-03  2:21 [PATCH] Fix hugetlb pool allocation with empty nodes Anton Blanchard
2007-05-03  3:02 ` Christoph Lameter
2007-05-03  6:07   ` Anton Blanchard
2007-05-03  6:37     ` Christoph Lameter
2007-05-03  8:59 ` Andi Kleen
2007-05-03 13:22   ` Anton Blanchard
2007-05-04 20:29 ` [PATCH] Fix hugetlb pool allocation with empty nodes - V2 Lee Schermerhorn
2007-05-04 21:27   ` Christoph Lameter
2007-05-04 22:39     ` Nish Aravamudan
2007-05-07 13:40     ` Lee Schermerhorn
2007-05-09 16:37     ` [PATCH] Fix hugetlb pool allocation with empty nodes - V2 -> V3 Lee Schermerhorn
2007-05-09 16:57       ` Christoph Lameter
2007-05-09 19:17         ` Lee Schermerhorn
2007-05-16 17:27           ` Nish Aravamudan
2007-05-16 20:01             ` Lee Schermerhorn
2007-05-09 19:59       ` Nish Aravamudan
2007-05-09 20:37         ` Lee Schermerhorn
2007-05-09 20:54           ` Christoph Lameter
2007-05-09 22:34           ` Nish Aravamudan
2007-05-15 16:30             ` Lee Schermerhorn
2007-05-16 23:47               ` Nish Aravamudan
2007-05-16 19:59       ` Nish Aravamudan
2007-05-16 20:32         ` Lee Schermerhorn
2007-05-16 22:17         ` [PATCH/RFC] Fix hugetlb pool allocation with empty nodes - V4 Lee Schermerhorn
2007-05-18  0:30           ` Nish Aravamudan
2007-05-21 14:57             ` Lee Schermerhorn
2007-05-21 17:51               ` Nish Aravamudan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).