All of lore.kernel.org
 help / color / mirror / Atom feed
* NUMA policy issues with ZONE_MOVABLE
@ 2007-07-25  4:20 Christoph Lameter
  2007-07-25  4:47 ` Nick Piggin
                   ` (3 more replies)
  0 siblings, 4 replies; 60+ messages in thread
From: Christoph Lameter @ 2007-07-25  4:20 UTC (permalink / raw)
  To: linux-mm; +Cc: Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, Mel Gorman, akpm

The outcome of the 2.6.23 merge was surprising. No antifrag but only 
ZONE_MOVABLE. ZONE_MOVABLE is the highest zone.

For the NUMA layer this has some weird consequences if ZONE_MOVABLE is populated

1. It is the highest zone.

2. Thus policy_zone == ZONE_MOVABLE

ZONE_MOVABLE contains only movable allocs by default. That is anonymous 
pages and page cache pages?

The NUMA layer only supports NUMA policies for the highest zone. 
Thus NUMA policies can control anonymous pages and the page cache pages 
allocated from ZONE_MOVABLE. 

However, NUMA policies will no longer affect non pagecache and non 
anonymous allocations. So policies can no longer redirect slab allocations 
and huge page allocations (unless huge page allocations are moved to 
ZONE_MOVABLE). And there are likely other allocations that are not 
movable.

If ZONE_MOVABLE is off then things should be working as normal.

Doesnt this mean that ZONE_MOVABLE is incompatible with CONFIG_NUMA?


The mobility approach used subcategories of a zone which would have 
allowed the application of memory policies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-25  4:20 NUMA policy issues with ZONE_MOVABLE Christoph Lameter
@ 2007-07-25  4:47 ` Nick Piggin
  2007-07-25  5:05   ` Christoph Lameter
  2007-07-25  6:36 ` KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 60+ messages in thread
From: Nick Piggin @ 2007-07-25  4:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, Mel Gorman, akpm

Christoph Lameter wrote:
> The outcome of the 2.6.23 merge was surprising. No antifrag but only 
> ZONE_MOVABLE. ZONE_MOVABLE is the highest zone.

ZONE_MOVABLE is the way to be able to guarantee contiguous memory
for hotplug and hugetlb without wasting too much memory, and is
very unintrusive for what it does. I think it was a good step
forward.

There is still disagreement about the antifrag patches, so what
is surprising about this outcome?


> For the NUMA layer this has some weird consequences if ZONE_MOVABLE is populated
> 
> 1. It is the highest zone.
> 
> 2. Thus policy_zone == ZONE_MOVABLE
> 
> ZONE_MOVABLE contains only movable allocs by default. That is anonymous 
> pages and page cache pages?
> 
> The NUMA layer only supports NUMA policies for the highest zone. 
> Thus NUMA policies can control anonymous pages and the page cache pages 
> allocated from ZONE_MOVABLE. 
> 
> However, NUMA policies will no longer affect non pagecache and non 
> anonymous allocations. So policies can no longer redirect slab allocations 
> and huge page allocations (unless huge page allocations are moved to 
> ZONE_MOVABLE). And there are likely other allocations that are not 
> movable.
> 
> If ZONE_MOVABLE is off then things should be working as normal.
> 
> Doesnt this mean that ZONE_MOVABLE is incompatible with CONFIG_NUMA?

I guess it has similar problems as ZONE_HIGHMEM etc. I think the
zoned allocator and NUMA was there first, so it might be more
correct to say that mempolicies are incompatible with them :)

But I thought you had plans to fix mempolicies to do zones better?
What happened to that?

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-25  4:47 ` Nick Piggin
@ 2007-07-25  5:05   ` Christoph Lameter
  2007-07-25  5:24     ` Nick Piggin
  0 siblings, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2007-07-25  5:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, Mel Gorman, akpm

On Wed, 25 Jul 2007, Nick Piggin wrote:

> > Doesnt this mean that ZONE_MOVABLE is incompatible with CONFIG_NUMA?
> 
> I guess it has similar problems as ZONE_HIGHMEM etc. I think the
> zoned allocator and NUMA was there first, so it might be more
> correct to say that mempolicies are incompatible with them :)

Highmem is only used on i386 NUMA and works fine on NUMAQ. The current 
zone types are carefully fitted to existing NUMA systems.
 
> But I thought you had plans to fix mempolicies to do zones better?

No sure where you got that from. I repeatedly suggested that more zones be 
removed because of this one and other issues.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-25  5:05   ` Christoph Lameter
@ 2007-07-25  5:24     ` Nick Piggin
  2007-07-25  6:00       ` Christoph Lameter
  2007-07-25  9:32       ` Andi Kleen
  0 siblings, 2 replies; 60+ messages in thread
From: Nick Piggin @ 2007-07-25  5:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, Mel Gorman, akpm

Christoph Lameter wrote:
> On Wed, 25 Jul 2007, Nick Piggin wrote:
> 
> 
>>>Doesnt this mean that ZONE_MOVABLE is incompatible with CONFIG_NUMA?
>>
>>I guess it has similar problems as ZONE_HIGHMEM etc. I think the
>>zoned allocator and NUMA was there first, so it might be more
>>correct to say that mempolicies are incompatible with them :)
> 
> 
> Highmem is only used on i386 NUMA and works fine on NUMAQ. The current 
> zone types are carefully fitted to existing NUMA systems.

I don't understand what you mean. Aren't mempolicies also supposed to
work on NUMAQ too? How about DMA and DMA32 allocations?


>>But I thought you had plans to fix mempolicies to do zones better?
> 
> 
> No sure where you got that from. I repeatedly suggested that more zones be 
> removed because of this one and other issues.

Oh I must have been mistaken.

Well I guess you haven't succeeded in getting zones removed, so I think
we should make mempolicies work better with zones.

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-25  5:24     ` Nick Piggin
@ 2007-07-25  6:00       ` Christoph Lameter
  2007-07-25  6:09         ` Nick Piggin
  2007-07-25  9:32       ` Andi Kleen
  1 sibling, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2007-07-25  6:00 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, Mel Gorman, akpm

On Wed, 25 Jul 2007, Nick Piggin wrote:

> > Highmem is only used on i386 NUMA and works fine on NUMAQ. The current zone
> > types are carefully fitted to existing NUMA systems.
> 
> I don't understand what you mean. Aren't mempolicies also supposed to
> work on NUMAQ too? How about DMA and DMA32 allocations?

Memory policies work on NUMAQ. Please read up on memory policies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-25  6:00       ` Christoph Lameter
@ 2007-07-25  6:09         ` Nick Piggin
  0 siblings, 0 replies; 60+ messages in thread
From: Nick Piggin @ 2007-07-25  6:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, Mel Gorman, akpm

Christoph Lameter wrote:
> On Wed, 25 Jul 2007, Nick Piggin wrote:
> 
> 
>>>Highmem is only used on i386 NUMA and works fine on NUMAQ. The current zone
>>>types are carefully fitted to existing NUMA systems.
>>
>>I don't understand what you mean. Aren't mempolicies also supposed to
>>work on NUMAQ too? How about DMA and DMA32 allocations?
> 
> 
> Memory policies work on NUMAQ. Please read up on memory policies.

Because the first 1GB will be on one node? Ok, maybe that happens
to work in an ugly sort of way. How about DMA32 then?

Do you disagree that mempolices should be made to work better with
multiple zones?

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-25  4:20 NUMA policy issues with ZONE_MOVABLE Christoph Lameter
  2007-07-25  4:47 ` Nick Piggin
@ 2007-07-25  6:36 ` KAMEZAWA Hiroyuki
  2007-07-25 11:16 ` Mel Gorman
  2007-07-25 14:27 ` Lee Schermerhorn
  3 siblings, 0 replies; 60+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-07-25  6:36 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, Lee Schermerhorn, ak, Mel Gorman, akpm

On Tue, 24 Jul 2007 21:20:45 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> The outcome of the 2.6.23 merge was surprising. No antifrag but only 
> ZONE_MOVABLE. ZONE_MOVABLE is the highest zone.
> 
> For the NUMA layer this has some weird consequences if ZONE_MOVABLE is populated
> 
> 1. It is the highest zone.
> 
> 2. Thus policy_zone == ZONE_MOVABLE
> 

I'm sorry that I'm not familiar with mempolicy's history. Can I make questions ? 

What was the main purpose of the policy_zone ?

mempolicy can work without policy_zone check ?


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-25  5:24     ` Nick Piggin
  2007-07-25  6:00       ` Christoph Lameter
@ 2007-07-25  9:32       ` Andi Kleen
  1 sibling, 0 replies; 60+ messages in thread
From: Andi Kleen @ 2007-07-25  9:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, linux-mm, Lee Schermerhorn, KAMEZAWA Hiroyuki,
	Mel Gorman, akpm

On Wednesday 25 July 2007 07:24:05 Nick Piggin wrote:

> I don't understand what you mean. Aren't mempolicies also supposed to
> work on NUMAQ too? How about DMA and DMA32 allocations?

bind mempolicies only support one zone, always the highest. This means on numaq
only highmem is policied.

DMA/DMA32 is not policied for obvious reasons (they often don't exist on
all nodes) 

> Well I guess you haven't succeeded in getting zones removed, so I think
> we should make mempolicies work better with zones.

Why? That would just complicate everything. In particular it would mean
you would need multiple fallback lists per VMA, which would increase
the memory usage significantly.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-25  4:20 NUMA policy issues with ZONE_MOVABLE Christoph Lameter
  2007-07-25  4:47 ` Nick Piggin
  2007-07-25  6:36 ` KAMEZAWA Hiroyuki
@ 2007-07-25 11:16 ` Mel Gorman
  2007-07-25 14:30   ` Lee Schermerhorn
  2007-07-25 19:31   ` Christoph Lameter
  2007-07-25 14:27 ` Lee Schermerhorn
  3 siblings, 2 replies; 60+ messages in thread
From: Mel Gorman @ 2007-07-25 11:16 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, akpm

On (24/07/07 21:20), Christoph Lameter didst pronounce:
> The outcome of the 2.6.23 merge was surprising. No antifrag but only 
> ZONE_MOVABLE. ZONE_MOVABLE is the highest zone.
> 
> For the NUMA layer this has some weird consequences if ZONE_MOVABLE is populated
> 
> 1. It is the highest zone.
> 
> 2. Thus policy_zone == ZONE_MOVABLE
> 
> ZONE_MOVABLE contains only movable allocs by default. That is anonymous 
> pages and page cache pages?
> 
> The NUMA layer only supports NUMA policies for the highest zone. 
> Thus NUMA policies can control anonymous pages and the page cache pages 
> allocated from ZONE_MOVABLE. 
> 
> However, NUMA policies will no longer affect non pagecache and non 
> anonymous allocations. So policies can no longer redirect slab allocations 
> and huge page allocations (unless huge page allocations are moved to 
> ZONE_MOVABLE). And there are likely other allocations that are not 
> movable.
> 
> If ZONE_MOVABLE is off then things should be working as normal.
> 
> Doesnt this mean that ZONE_MOVABLE is incompatible with CONFIG_NUMA?
>  

No but it has to be dealt with. I would have preferred this was highlighted
earlier but there is a candidate fix below.  It appears to be the minimum
solution to allow policies to work as they do today but remaining compatible
with ZONE_MOVABLE. It works by

o check_highest_zone will be the highest populated zone that is not ZONE_MOVEABLE
o bind_zonelist builds a zonelist of all populated zones, not policy_zone and lower
o The page allocator checks what the highest usable zone is and ignores
  zones in the zonelist that should not be used

This allows some other interesting possibilities

o We could have just one zonelist per node if the page allocator will
  skip over unsuitable zones for the gfp_mask. That would save memory
o We could get rid of policy_zone altogether.

On the second point here, policy_zone and how it is used is a bit
mad. Particularly, its behaviour on machines with multiple zones is a
little unpredictable with cross-platform applications potentially behaving
different on IA64 than x86_64 for example.  However, a test patch that would
delete it looked as if it would break NUMAQ if a process was bound to nodes
2 and 3 but not 0 for example because slab allocations would fail. Similar,
it would have consequences on x86_64 with NORMAL and DMA32.

Here is the patch just to handle policies with ZONE_MOVABLE. The highest
zone still gets treated as it does today but allocations using ZONE_MOVABLE
will still be policied. It has been boot-tested and a basic compile job run
on a x86_64 NUMA machine (elm3b6 on test.kernel.org). Is there a
standard test for regression testing policies?

Comments?

Signed-off-by: Mel Gorman <mel@csn.ul.ie>

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index e147cf5..5bdd656 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -166,7 +166,7 @@ extern enum zone_type policy_zone;
 
 static inline void check_highest_zone(enum zone_type k)
 {
-	if (k > policy_zone)
+	if (k > policy_zone && k != ZONE_MOVABLE)
 		policy_zone = k;
 }
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 71b84b4..e798be5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -149,7 +144,7 @@ static struct zonelist *bind_zonelist(nodemask_t *nodes)
 	   lower zones etc. Avoid empty zones because the memory allocator
 	   doesn't like them. If you implement node hot removal you
 	   have to fix that. */
-	k = policy_zone;
+	k = MAX_NR_ZONES - 1;
 	while (1) {
 		for_each_node_mask(nd, *nodes) { 
 			struct zone *z = &NODE_DATA(nd)->node_zones[k];
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 40954fb..22485d5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1157,6 +1157,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
+	enum zone_type highest_zoneidx;
 
 zonelist_scan:
 	/*
@@ -1165,10 +1166,23 @@ zonelist_scan:
 	 */
 	z = zonelist->zones;
 
+	/* For memory policies, get the highest allowed zone by the flags */
+	if (NUMA_BUILD)
+		highest_zoneidx = gfp_zone(gfp_mask);
+
 	do {
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
+
+		/*
+		 * In NUMA, this could be a policy zonelist which contains
+		 * zones that may not be allowed by the current gfp_mask.
+		 * Check the zone is allowed by the current flags
+		 */
+		if (NUMA_BUILD && zone_idx(*z) > highest_zoneidx)
+			continue;
+
 		zone = *z;
 		if (unlikely(NUMA_BUILD && (gfp_mask & __GFP_THISNODE) &&
 			zone->zone_pgdat != zonelist->zones[0]->zone_pgdat))

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-25  4:20 NUMA policy issues with ZONE_MOVABLE Christoph Lameter
                   ` (2 preceding siblings ...)
  2007-07-25 11:16 ` Mel Gorman
@ 2007-07-25 14:27 ` Lee Schermerhorn
  2007-07-25 17:39   ` Mel Gorman
  3 siblings, 1 reply; 60+ messages in thread
From: Lee Schermerhorn @ 2007-07-25 14:27 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, ak, KAMEZAWA Hiroyuki, Mel Gorman, akpm

On Tue, 2007-07-24 at 21:20 -0700, Christoph Lameter wrote:
> The outcome of the 2.6.23 merge was surprising. No antifrag but only 
> ZONE_MOVABLE. ZONE_MOVABLE is the highest zone.
> 
> For the NUMA layer this has some weird consequences if ZONE_MOVABLE is populated
> 
> 1. It is the highest zone.
> 
> 2. Thus policy_zone == ZONE_MOVABLE
> 
> ZONE_MOVABLE contains only movable allocs by default. That is anonymous 
> pages and page cache pages?
> 
> The NUMA layer only supports NUMA policies for the highest zone. 
> Thus NUMA policies can control anonymous pages and the page cache pages 
> allocated from ZONE_MOVABLE. 
> 
> However, NUMA policies will no longer affect non pagecache and non 
> anonymous allocations. So policies can no longer redirect slab allocations 
> and huge page allocations (unless huge page allocations are moved to 
> ZONE_MOVABLE). And there are likely other allocations that are not 
> movable.
> 
> If ZONE_MOVABLE is off then things should be working as normal.
> 
> Doesnt this mean that ZONE_MOVABLE is incompatible with CONFIG_NUMA?
> 
> 
> The mobility approach used subcategories of a zone which would have 
> allowed the application of memory policies.

Isn't ZONE_MOVABLE always a subset of the memory in the highest "real"
zone--the one that WOULD be policy_zone if ZONE_MOVABLE weren't
configured?  If so, perhaps we could just not assign ZONE_MOVABLE to
policy_zone in check_highest zone.  We already check for >= or <
policy_zone where it's checked [zonelist_policy() and vma_migratable()],
so ZONE_MOVABLE will get a free pass if we clip policy_zone at the
highest !MOVABLE zone.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-25 11:16 ` Mel Gorman
@ 2007-07-25 14:30   ` Lee Schermerhorn
  2007-07-25 19:31   ` Christoph Lameter
  1 sibling, 0 replies; 60+ messages in thread
From: Lee Schermerhorn @ 2007-07-25 14:30 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Lameter, linux-mm, ak, KAMEZAWA Hiroyuki, akpm

On Wed, 2007-07-25 at 12:16 +0100, Mel Gorman wrote:
> On (24/07/07 21:20), Christoph Lameter didst pronounce:
> > The outcome of the 2.6.23 merge was surprising. No antifrag but only 
> > ZONE_MOVABLE. ZONE_MOVABLE is the highest zone.
> > 
> > For the NUMA layer this has some weird consequences if ZONE_MOVABLE is populated
> > 
> > 1. It is the highest zone.
> > 
> > 2. Thus policy_zone == ZONE_MOVABLE
> > 
> > ZONE_MOVABLE contains only movable allocs by default. That is anonymous 
> > pages and page cache pages?
> > 
> > The NUMA layer only supports NUMA policies for the highest zone. 
> > Thus NUMA policies can control anonymous pages and the page cache pages 
> > allocated from ZONE_MOVABLE. 
> > 
> > However, NUMA policies will no longer affect non pagecache and non 
> > anonymous allocations. So policies can no longer redirect slab allocations 
> > and huge page allocations (unless huge page allocations are moved to 
> > ZONE_MOVABLE). And there are likely other allocations that are not 
> > movable.
> > 
> > If ZONE_MOVABLE is off then things should be working as normal.
> > 
> > Doesnt this mean that ZONE_MOVABLE is incompatible with CONFIG_NUMA?
> >  
> 
> No but it has to be dealt with. I would have preferred this was highlighted
> earlier but there is a candidate fix below.  It appears to be the minimum
> solution to allow policies to work as they do today but remaining compatible
> with ZONE_MOVABLE. It works by
> 
> o check_highest_zone will be the highest populated zone that is not ZONE_MOVEABLE

Ah, sick minds think alike...  ;-)

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-25 14:27 ` Lee Schermerhorn
@ 2007-07-25 17:39   ` Mel Gorman
  0 siblings, 0 replies; 60+ messages in thread
From: Mel Gorman @ 2007-07-25 17:39 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Christoph Lameter, linux-mm, ak, KAMEZAWA Hiroyuki, akpm

On (25/07/07 10:27), Lee Schermerhorn didst pronounce:
> On Tue, 2007-07-24 at 21:20 -0700, Christoph Lameter wrote:
> > The outcome of the 2.6.23 merge was surprising. No antifrag but only 
> > ZONE_MOVABLE. ZONE_MOVABLE is the highest zone.
> > 
> > For the NUMA layer this has some weird consequences if ZONE_MOVABLE is populated
> > 
> > 1. It is the highest zone.
> > 
> > 2. Thus policy_zone == ZONE_MOVABLE
> > 
> > ZONE_MOVABLE contains only movable allocs by default. That is anonymous 
> > pages and page cache pages?
> > 
> > The NUMA layer only supports NUMA policies for the highest zone. 
> > Thus NUMA policies can control anonymous pages and the page cache pages 
> > allocated from ZONE_MOVABLE. 
> > 
> > However, NUMA policies will no longer affect non pagecache and non 
> > anonymous allocations. So policies can no longer redirect slab allocations 
> > and huge page allocations (unless huge page allocations are moved to 
> > ZONE_MOVABLE). And there are likely other allocations that are not 
> > movable.
> > 
> > If ZONE_MOVABLE is off then things should be working as normal.
> > 
> > Doesnt this mean that ZONE_MOVABLE is incompatible with CONFIG_NUMA?
> > 
> > 
> > The mobility approach used subcategories of a zone which would have 
> > allowed the application of memory policies.
> 
> Isn't ZONE_MOVABLE always a subset of the memory in the highest "real"
> zone--the one that WOULD be policy_zone if ZONE_MOVABLE weren't
> configured? 

Yes, it is always the case because the selected zone is always the same
zone as policy_zone.

> If so, perhaps we could just not assign ZONE_MOVABLE to
> policy_zone in check_highest zone. 

Yep.

> We already check for >= or <
> policy_zone where it's checked [zonelist_policy() and vma_migratable()],
> so ZONE_MOVABLE will get a free pass if we clip policy_zone at the
> highest !MOVABLE zone.
> 

Fully agreed on all counts. I'm pleased that this is pretty much
identical to what I have in the patch.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-25 11:16 ` Mel Gorman
  2007-07-25 14:30   ` Lee Schermerhorn
@ 2007-07-25 19:31   ` Christoph Lameter
  2007-07-26  4:15     ` KAMEZAWA Hiroyuki
                       ` (2 more replies)
  1 sibling, 3 replies; 60+ messages in thread
From: Christoph Lameter @ 2007-07-25 19:31 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, akpm, pj

On Wed, 25 Jul 2007, Mel Gorman wrote:

> o check_highest_zone will be the highest populated zone that is not ZONE_MOVEABLE
> o bind_zonelist builds a zonelist of all populated zones, not policy_zone and lower
> o The page allocator checks what the highest usable zone is and ignores
>   zones in the zonelist that should not be used

Which is a performance impact that we would rather avoid since we are now 
filtering zonelists on every allocation. But we have other issues as well 
that would be fixed by this approach.

How about changing __alloc_pages to lookup the zonelist on its own based 
on a node parameter and a set of allowed nodes? That may significantly 
clean up the memory policy layer and the cpuset layer. But it will 
increase the effort to scan zonelists on each allocation. A large system 
with 1024 nodes may have more than 1024 zones on each nodelist!

> On the second point here, policy_zone and how it is used is a bit
> mad. Particularly, its behaviour on machines with multiple zones is a
> little unpredictable with cross-platform applications potentially behaving
> different on IA64 than x86_64 for example.  However, a test patch that would
> delete it looked as if it would break NUMAQ if a process was bound to nodes
> 2 and 3 but not 0 for example because slab allocations would fail. Similar,
> it would have consequences on x86_64 with NORMAL and DMA32.

Nope it would not fail. NUMAQ has policy_zone == HIGHMEM and slab 
allocations do not use highmem. Memory policies are not applied to slab 
allocs on NUMAQ. Thus slab allocations will use node 0 even 
if you restrict allocs to node 2 and 3.

> Here is the patch just to handle policies with ZONE_MOVABLE. The highest
> zone still gets treated as it does today but allocations using ZONE_MOVABLE
> will still be policied. It has been boot-tested and a basic compile job run
> on a x86_64 NUMA machine (elm3b6 on test.kernel.org). Is there a
> standard test for regression testing policies?

There is a test in the numactl package by Andi Kleen.

> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index e147cf5..5bdd656 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -166,7 +166,7 @@ extern enum zone_type policy_zone;
>  
>  static inline void check_highest_zone(enum zone_type k)
>  {
> -	if (k > policy_zone)
> +	if (k > policy_zone && k != ZONE_MOVABLE)
>  		policy_zone = k;
>  }

That actually cleans up stuff...

> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 71b84b4..e798be5 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -149,7 +144,7 @@ static struct zonelist *bind_zonelist(nodemask_t *nodes)
>  	   lower zones etc. Avoid empty zones because the memory allocator
>  	   doesn't like them. If you implement node hot removal you
>  	   have to fix that. */
> -	k = policy_zone;
> +	k = MAX_NR_ZONES - 1;

k = ZONE_MOVABLE?

>  	while (1) {
>  		for_each_node_mask(nd, *nodes) { 
>  			struct zone *z = &NODE_DATA(nd)->node_zones[k];

So bind zonelists now include two zones per node: The origin of 
ZONE_MOVABLE and ZONE_MOVABLE.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 40954fb..22485d5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1157,6 +1157,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
>  	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
>  	int zlc_active = 0;		/* set if using zonelist_cache */
>  	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
> +	enum zone_type highest_zoneidx;
>  
>  zonelist_scan:
>  	/*
> @@ -1165,10 +1166,23 @@ zonelist_scan:
>  	 */
>  	z = zonelist->zones;
>  
> +	/* For memory policies, get the highest allowed zone by the flags */
> +	if (NUMA_BUILD)
> +		highest_zoneidx = gfp_zone(gfp_mask);
> +
>  	do {
>  		if (NUMA_BUILD && zlc_active &&
>  			!zlc_zone_worth_trying(zonelist, z, allowednodes))
>  				continue;
> +
> +		/*
> +		 * In NUMA, this could be a policy zonelist which contains
> +		 * zones that may not be allowed by the current gfp_mask.
> +		 * Check the zone is allowed by the current flags
> +		 */
> +		if (NUMA_BUILD && zone_idx(*z) > highest_zoneidx)
> +			continue;
> +

Skip the zones that are higher?

So for a __GFP_MOVABLE alloc we would scan all zones and for 
policy_zone just the policy zone.

Lee should probably also review this in detail since he has recent 
experience fiddling around with memory policies. Paul has also 
experience in this area.

Maybe this can actually  help to deal with some of the corner cases of 
memory policies (just hope the performance impact is not significant).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-25 19:31   ` Christoph Lameter
@ 2007-07-26  4:15     ` KAMEZAWA Hiroyuki
  2007-07-26  4:53       ` Christoph Lameter
  2007-07-26 16:16       ` Mel Gorman
  2007-07-26 13:23     ` Mel Gorman
  2007-08-02 14:09     ` Mel Gorman
  2 siblings, 2 replies; 60+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-07-26  4:15 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Mel Gorman, linux-mm, Lee Schermerhorn, ak, akpm, pj

On Wed, 25 Jul 2007 12:31:21 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:
> So for a __GFP_MOVABLE alloc we would scan all zones and for 
> policy_zone just the policy zone.
> 
> Lee should probably also review this in detail since he has recent 
> experience fiddling around with memory policies. Paul has also 
> experience in this area.
> 
> Maybe this can actually  help to deal with some of the corner cases of 
> memory policies (just hope the performance impact is not significant).
> 
> 
Hmm,  How about following patch ? (not tested, just an idea).
I'm sorry if I misunderstand concept ot policy_zone.

==
Index: linux-2.6.23-rc1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.23-rc1.orig/include/linux/mempolicy.h
+++ linux-2.6.23-rc1/include/linux/mempolicy.h
@@ -162,14 +162,11 @@ extern struct zonelist *huge_zonelist(st
 		unsigned long addr, gfp_t gfp_flags);
 extern unsigned slab_node(struct mempolicy *policy);
 
+/*
+ * The smalles zone_idx which all nodes can offer against GFP_xxx
+ */
 extern enum zone_type policy_zone;
 
-static inline void check_highest_zone(enum zone_type k)
-{
-	if (k > policy_zone)
-		policy_zone = k;
-}
-
 int do_migrate_pages(struct mm_struct *mm,
 	const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags);
 
Index: linux-2.6.23-rc1/mm/page_alloc.c
===================================================================
--- linux-2.6.23-rc1.orig/mm/page_alloc.c
+++ linux-2.6.23-rc1/mm/page_alloc.c
@@ -1648,7 +1648,6 @@ static int build_zonelists_node(pg_data_
 		zone = pgdat->node_zones + zone_type;
 		if (populated_zone(zone)) {
 			zonelist->zones[nr_zones++] = zone;
-			check_highest_zone(zone_type);
 		}
 
 	} while (zone_type);
@@ -1857,7 +1856,6 @@ static void build_zonelists_in_zone_orde
 				z = &NODE_DATA(node)->node_zones[zone_type];
 				if (populated_zone(z)) {
 					zonelist->zones[pos++] = z;
-					check_highest_zone(zone_type);
 				}
 			}
 		}
@@ -1934,6 +1932,7 @@ static void build_zonelists(pg_data_t *p
 	int local_node, prev_node;
 	struct zonelist *zonelist;
 	int order = current_zonelist_order;
+	int highest_zone;
 
 	/* initialize zonelists */
 	for (i = 0; i < MAX_NR_ZONES; i++) {
@@ -1981,6 +1980,32 @@ static void build_zonelists(pg_data_t *p
 		/* calculate node order -- i.e., DMA last! */
 		build_zonelists_in_zone_order(pgdat, j);
 	}
+	/*
+	 * Find the lowest zone where mempolicy (MBID) can work well.
+ 	 */
+	highest_zone = 0;
+	policy_zone = -1;
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		struct zone *first_zone;
+		int success = 1;
+		for_each_node_state(node, N_MEMORY) {
+			first_zone = NODE_DATA(node)->node_zonelists[i][0];
+			if (zone_idx(first_zone) > highest_zone)
+				highest_zone = zone_idx(first_zone);
+			if (first_zone->zone_pgdat != NODE_DATA(node)) {
+				/* This node cannot offer right pages for this
+				   GFP */
+				success = 0;
+				break;
+			}
+		}
+		if (success) {
+			policy_zone = i;
+			break;
+		}
+	}
+	if (policy_zone == -1)
+		policy_zone = highest_zone;
 }
 
 /* Construct the zonelist performance cache - see further mmzone.h */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-26  4:15     ` KAMEZAWA Hiroyuki
@ 2007-07-26  4:53       ` Christoph Lameter
  2007-07-26  7:41         ` KAMEZAWA Hiroyuki
  2007-07-26 16:16       ` Mel Gorman
  1 sibling, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2007-07-26  4:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Mel Gorman, linux-mm, Lee Schermerhorn, ak, akpm, pj

On Thu, 26 Jul 2007, KAMEZAWA Hiroyuki wrote:

> Hmm,  How about following patch ? (not tested, just an idea).
> I'm sorry if I misunderstand concept ot policy_zone.

Maybe we should get rid of policy zone completely? There are only a few 
lower zones on a NUMA machine anyways and if the filtering in 
__alloc_pages does the trick then we could simply generate lists will
all zones in build_bindzonelist.

The main dividing line may be if zones are available on all (memory) 
nodes. If they are only available on a single nodes (like DMA or DMA32) 
then policies must be disregarded if the alloc would otherwise not be 
possible.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-26  4:53       ` Christoph Lameter
@ 2007-07-26  7:41         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 60+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-07-26  7:41 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Mel Gorman, linux-mm, Lee Schermerhorn, ak, akpm, pj

On Wed, 25 Jul 2007 21:53:32 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Thu, 26 Jul 2007, KAMEZAWA Hiroyuki wrote:
> 
> > Hmm,  How about following patch ? (not tested, just an idea).
> > I'm sorry if I misunderstand concept ot policy_zone.
> 
> Maybe we should get rid of policy zone completely? There are only a few 
> lower zones on a NUMA machine anyways and if the filtering in 
> __alloc_pages does the trick then we could simply generate lists will
> all zones in build_bindzonelist.
> 
> The main dividing line may be if zones are available on all (memory) 
> nodes. If they are only available on a single nodes (like DMA or DMA32) 
> then policies must be disregarded if the alloc would otherwise not be 
> possible.
> 

IMHO, when using customized zonelists, zonelists[MAX_NR_ZONES] should be
prepared for all gfp_zone(GFP_xxx). But zonlists[] can be very big.

Another thinking, currnet MBIND uses pages from lower nodes, (nodes have lower ids.)
even if the node is far. And all process which uses MBIND have the same tendency.

I'd like to vote for implementing node_mask check in alloc_pages, but doesn't have
good idea to implement it in efficient manner on 1024-nodes server...

like, alloc_page_mask(gftp_t gftp, int order, nodemask_t mask);


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-25 19:31   ` Christoph Lameter
  2007-07-26  4:15     ` KAMEZAWA Hiroyuki
@ 2007-07-26 13:23     ` Mel Gorman
  2007-07-26 18:07       ` Christoph Lameter
  2007-07-26 18:09       ` Lee Schermerhorn
  2007-08-02 14:09     ` Mel Gorman
  2 siblings, 2 replies; 60+ messages in thread
From: Mel Gorman @ 2007-07-26 13:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, akpm, pj

On (25/07/07 12:31), Christoph Lameter didst pronounce:
> On Wed, 25 Jul 2007, Mel Gorman wrote:
> 
> > o check_highest_zone will be the highest populated zone that is not ZONE_MOVEABLE
> > o bind_zonelist builds a zonelist of all populated zones, not policy_zone and lower
> > o The page allocator checks what the highest usable zone is and ignores
> >   zones in the zonelist that should not be used
> 
> Which is a performance impact that we would rather avoid since we are now 
> filtering zonelists on every allocation.

Yes, that is true. Only allocations using the MPOL_BIND policy would need
to do this checking so it's be best to only filter zones when necessary.

Prehaps if zonelist had a parameter called should_filter that is only set
for the policy zonelists that need this checking. That would avoid doing
any filter checking for almost all allocations.

So we would have;

static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
                struct zonelist *zonelist, int alloc_flags)
{
	...
	int should_filter = zonelist->should_filter;

	...

	if (NUMA_BUILD && should_filter)
		highest_zoneidx = gfp_zone(gfp_mask);

	do {
		...
		if (NUMA_BUILD && should_filter && zone_idx(*z) > highest_zoneidx)
			continue;

}

?

That would avoid filtering in normal cases and particular avoiding using
zone_idx() a lot.

> But we have other issues as well 
> that would be fixed by this approach.
> 
> How about changing __alloc_pages to lookup the zonelist on its own based 
> on a node parameter and a set of allowed nodes? That may significantly 
> clean up the memory policy layer and the cpuset layer. But it will 
> increase the effort to scan zonelists on each allocation. A large system 
> with 1024 nodes may have more than 1024 zones on each nodelist!
> 

That sounds like it would require the creation of a zonelist for each
allocation attempt. That is not ideal as there is no place to allocate
the zonelist during __alloc_pages(). It's not like it can call
kmalloc().

> > On the second point here, policy_zone and how it is used is a bit
> > mad. Particularly, its behaviour on machines with multiple zones is a
> > little unpredictable with cross-platform applications potentially behaving
> > different on IA64 than x86_64 for example.  However, a test patch that would
> > delete it looked as if it would break NUMAQ if a process was bound to nodes
> > 2 and 3 but not 0 for example because slab allocations would fail. Similar,
> > it would have consequences on x86_64 with NORMAL and DMA32.
> 
> Nope it would not fail. NUMAQ has policy_zone == HIGHMEM and slab 
> allocations do not use highmem.

It would fail if policy_zone didn't exist, that was my point. Without
policy_zone, we apply policy to all allocations and that causes
problems.

> Memory policies are not applied to slab 
> allocs on NUMAQ. Thus slab allocations will use node 0 even 
> if you restrict allocs to node 2 and 3.
> 

They are not applied because policy_zone is used.

> > Here is the patch just to handle policies with ZONE_MOVABLE. The highest
> > zone still gets treated as it does today but allocations using ZONE_MOVABLE
> > will still be policied. It has been boot-tested and a basic compile job run
> > on a x86_64 NUMA machine (elm3b6 on test.kernel.org). Is there a
> > standard test for regression testing policies?
> 
> There is a test in the numactl package by Andi Kleen.
> 

ok, thanks.

> > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> > index e147cf5..5bdd656 100644
> > --- a/include/linux/mempolicy.h
> > +++ b/include/linux/mempolicy.h
> > @@ -166,7 +166,7 @@ extern enum zone_type policy_zone;
> >  
> >  static inline void check_highest_zone(enum zone_type k)
> >  {
> > -	if (k > policy_zone)
> > +	if (k > policy_zone && k != ZONE_MOVABLE)
> >  		policy_zone = k;
> >  }
> 
> That actually cleans up stuff...
> 
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 71b84b4..e798be5 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -149,7 +144,7 @@ static struct zonelist *bind_zonelist(nodemask_t *nodes)
> >  	   lower zones etc. Avoid empty zones because the memory allocator
> >  	   doesn't like them. If you implement node hot removal you
> >  	   have to fix that. */
> > -	k = policy_zone;
> > +	k = MAX_NR_ZONES - 1;
> 
> k = ZONE_MOVABLE?
> 

I would work as k = ZONE_MOVABLE but the intention of the code is to add
all populated zones to the list, not all zones below ZONE_MOVABLE.

> >  	while (1) {
> >  		for_each_node_mask(nd, *nodes) { 
> >  			struct zone *z = &NODE_DATA(nd)->node_zones[k];
> 
> So bind zonelists now include two zones per node: The origin of 
> ZONE_MOVABLE and ZONE_MOVABLE.
> 

Right, hence the filtering later.

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 40954fb..22485d5 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1157,6 +1157,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
> >  	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
> >  	int zlc_active = 0;		/* set if using zonelist_cache */
> >  	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
> > +	enum zone_type highest_zoneidx;
> >  
> >  zonelist_scan:
> >  	/*
> > @@ -1165,10 +1166,23 @@ zonelist_scan:
> >  	 */
> >  	z = zonelist->zones;
> >  
> > +	/* For memory policies, get the highest allowed zone by the flags */
> > +	if (NUMA_BUILD)
> > +		highest_zoneidx = gfp_zone(gfp_mask);
> > +
> >  	do {
> >  		if (NUMA_BUILD && zlc_active &&
> >  			!zlc_zone_worth_trying(zonelist, z, allowednodes))
> >  				continue;
> > +
> > +		/*
> > +		 * In NUMA, this could be a policy zonelist which contains
> > +		 * zones that may not be allowed by the current gfp_mask.
> > +		 * Check the zone is allowed by the current flags
> > +		 */
> > +		if (NUMA_BUILD && zone_idx(*z) > highest_zoneidx)
> > +			continue;
> > +
> 
> Skip the zones that are higher?
> 

Yeah, exactly. This has the affect of applying policy to policy_zone
(the highest zone) and ZONE_MOVABLE (which takes it's memory from the
highest zone)

> So for a __GFP_MOVABLE alloc we would scan all zones and for 
> policy_zone just the policy zone.
> 

Exactly.

> Lee should probably also review this in detail since he has recent 
> experience fiddling around with memory policies. Paul has also 
> experience in this area.
> 

Lee had suggested almost the exact same solution but I'd like to hear if
the implementation matches his expectation.

> Maybe this can actually  help to deal with some of the corner cases of 
> memory policies (just hope the performance impact is not significant).

I ran the patch on a wide variety of machines, NUMA and non-NUMA. The
non-NUMA machines showed no differences as you would expect for
kernbench and aim9. On NUMA machines, I saw both small gains and small
regressions. By and large, the performance was the same or within 0.08%
for kernbench which is within noise basically.

It might be more pronounced on larger NUMA machines though, I cannot
generate those figures.

I'll try adding a should_filter to zonelist that is only set for
MPOL_BIND and see what it looks like.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-26  4:15     ` KAMEZAWA Hiroyuki
  2007-07-26  4:53       ` Christoph Lameter
@ 2007-07-26 16:16       ` Mel Gorman
  2007-07-26 18:03         ` Christoph Lameter
  1 sibling, 1 reply; 60+ messages in thread
From: Mel Gorman @ 2007-07-26 16:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Lameter, linux-mm, Lee Schermerhorn, ak, akpm, pj

On (26/07/07 13:15), KAMEZAWA Hiroyuki didst pronounce:
> On Wed, 25 Jul 2007 12:31:21 -0700 (PDT)
> Christoph Lameter <clameter@sgi.com> wrote:
> > So for a __GFP_MOVABLE alloc we would scan all zones and for 
> > policy_zone just the policy zone.
> > 
> > Lee should probably also review this in detail since he has recent 
> > experience fiddling around with memory policies. Paul has also 
> > experience in this area.
> > 
> > Maybe this can actually  help to deal with some of the corner cases of 
> > memory policies (just hope the performance impact is not significant).
> > 
> > 
>
> Hmm,  How about following patch ? (not tested, just an idea).
> I'm sorry if I misunderstand concept ot policy_zone.
> 

The following seems like a good idea to do anyway.

> ==
> Index: linux-2.6.23-rc1/include/linux/mempolicy.h
> ===================================================================
> --- linux-2.6.23-rc1.orig/include/linux/mempolicy.h
> +++ linux-2.6.23-rc1/include/linux/mempolicy.h
> @@ -162,14 +162,11 @@ extern struct zonelist *huge_zonelist(st
>  		unsigned long addr, gfp_t gfp_flags);
>  extern unsigned slab_node(struct mempolicy *policy);
>  
> +/*
> + * The smalles zone_idx which all nodes can offer against GFP_xxx
> + */
>  extern enum zone_type policy_zone;
>  

The comment is a little misleading

/* policy_zone is the lowest zone index that is present on all nodes */

Right?

> -static inline void check_highest_zone(enum zone_type k)
> -{
> -	if (k > policy_zone)
> -		policy_zone = k;
> -}
> -
>  int do_migrate_pages(struct mm_struct *mm,
>  	const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags);
>  
> Index: linux-2.6.23-rc1/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.23-rc1.orig/mm/page_alloc.c
> +++ linux-2.6.23-rc1/mm/page_alloc.c
> @@ -1648,7 +1648,6 @@ static int build_zonelists_node(pg_data_
>  		zone = pgdat->node_zones + zone_type;
>  		if (populated_zone(zone)) {
>  			zonelist->zones[nr_zones++] = zone;
> -			check_highest_zone(zone_type);
>  		}
>  
>  	} while (zone_type);
> @@ -1857,7 +1856,6 @@ static void build_zonelists_in_zone_orde
>  				z = &NODE_DATA(node)->node_zones[zone_type];
>  				if (populated_zone(z)) {
>  					zonelist->zones[pos++] = z;
> -					check_highest_zone(zone_type);
>  				}
>  			}
>  		}
> @@ -1934,6 +1932,7 @@ static void build_zonelists(pg_data_t *p
>  	int local_node, prev_node;
>  	struct zonelist *zonelist;
>  	int order = current_zonelist_order;
> +	int highest_zone;
>  
>  	/* initialize zonelists */
>  	for (i = 0; i < MAX_NR_ZONES; i++) {
> @@ -1981,6 +1980,32 @@ static void build_zonelists(pg_data_t *p
>  		/* calculate node order -- i.e., DMA last! */
>  		build_zonelists_in_zone_order(pgdat, j);
>  	}
> +	/*
> +	 * Find the lowest zone where mempolicy (MBID) can work well.
> + 	 */

/*
 * Find the lowest zone such that using the MPOL_BIND policy with
 * an arbitrary set of nodes will not go OOM because a suitable
 * zone was unavailable
 */

> +	highest_zone = 0;
> +	policy_zone = -1;
> +	for (i = 0; i < MAX_NR_ZONES; i++) {
> +		struct zone *first_zone;
> +		int success = 1;
> +		for_each_node_state(node, N_MEMORY) {
> +			first_zone = NODE_DATA(node)->node_zonelists[i][0];
> +			if (zone_idx(first_zone) > highest_zone)
> +				highest_zone = zone_idx(first_zone);
> +			if (first_zone->zone_pgdat != NODE_DATA(node)) {
> +				/* This node cannot offer right pages for this
> +				   GFP */
> +				success = 0;
> +				break;
> +			}

The second "if" needs to go first I believe.

> +		}
> +		if (success) {
> +			policy_zone = i;
> +			break;
> +		}
> +	}
> +	if (policy_zone == -1)
> +		policy_zone = highest_zone;
>  }
>  
>  /* Construct the zonelist performance cache - see further mmzone.h */

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-26 16:16       ` Mel Gorman
@ 2007-07-26 18:03         ` Christoph Lameter
  2007-07-26 18:26           ` Mel Gorman
  0 siblings, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2007-07-26 18:03 UTC (permalink / raw)
  To: Mel Gorman; +Cc: KAMEZAWA Hiroyuki, linux-mm, Lee Schermerhorn, ak, akpm, pj

On Thu, 26 Jul 2007, Mel Gorman wrote:

> /* policy_zone is the lowest zone index that is present on all nodes */
> 
> Right?

Nope. In a 4 node x86_64 opteron configuration with 8GB memory in 4 2GB 
chunks you could have

node 0	ZONE_DMA, ZONE_DMA32   <2GB
node 1  ZONE_DMA32		<4GB
node 2	ZONE_NORMAL		<6GB
node 3  ZONE_NORMAL		<8GB

So the highest zone gets partitioned off? We only have ZONE_MOVABLE on 
nodes 2 and 3?

There are some other weirdnesses possible with ZONE_MOVABLE on !NUMA.

1GB i386 system

ZONE_DMA
ZONE_NORMAL <900k
ZONE_HIGHEMEM	100k size


ZONE_MOVABLE can then only use 100k?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-26 13:23     ` Mel Gorman
@ 2007-07-26 18:07       ` Christoph Lameter
  2007-07-26 22:59         ` Mel Gorman
  2007-07-26 18:09       ` Lee Schermerhorn
  1 sibling, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2007-07-26 18:07 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, akpm, pj

On Thu, 26 Jul 2007, Mel Gorman wrote:

> > How about changing __alloc_pages to lookup the zonelist on its own based 
> > on a node parameter and a set of allowed nodes? That may significantly 
> > clean up the memory policy layer and the cpuset layer. But it will 
> > increase the effort to scan zonelists on each allocation. A large system 
> > with 1024 nodes may have more than 1024 zones on each nodelist!
> > 
> 
> That sounds like it would require the creation of a zonelist for each
> allocation attempt. That is not ideal as there is no place to allocate
> the zonelist during __alloc_pages(). It's not like it can call
> kmalloc().

Nope it would just require scanning the full zonelists on every alloc as 
you already propose.

> > Nope it would not fail. NUMAQ has policy_zone == HIGHMEM and slab 
> > allocations do not use highmem.
> 
> It would fail if policy_zone didn't exist, that was my point. Without
> policy_zone, we apply policy to all allocations and that causes
> problems.

policy_zone can not exist due to ZONE_DMA32 ZONE_NORMAL issues. See my 
other email.


> I ran the patch on a wide variety of machines, NUMA and non-NUMA. The
> non-NUMA machines showed no differences as you would expect for
> kernbench and aim9. On NUMA machines, I saw both small gains and small
> regressions. By and large, the performance was the same or within 0.08%
> for kernbench which is within noise basically.

Sound okay.

> It might be more pronounced on larger NUMA machines though, I cannot
> generate those figures.

I say lets go with the filtering. That would allow us to also catch other 
issues that are now developing on x86_64 with ZONE_NORMAL and ZONE_DMA32.
 
> I'll try adding a should_filter to zonelist that is only set for
> MPOL_BIND and see what it looks like.

Maybe that is not worth it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-26 13:23     ` Mel Gorman
  2007-07-26 18:07       ` Christoph Lameter
@ 2007-07-26 18:09       ` Lee Schermerhorn
  1 sibling, 0 replies; 60+ messages in thread
From: Lee Schermerhorn @ 2007-07-26 18:09 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Lameter, linux-mm, ak, KAMEZAWA Hiroyuki, akpm, pj

On Thu, 2007-07-26 at 14:23 +0100, Mel Gorman wrote:
> On (25/07/07 12:31), Christoph Lameter didst pronounce:
<snip>
> 
> > Lee should probably also review this in detail since he has recent 
> > experience fiddling around with memory policies. Paul has also 
> > experience in this area.
> > 
> 
> Lee had suggested almost the exact same solution but I'd like to hear if
> the implementation matches his expectation.
> 

Mel:

Your patch looks good to me.  I will add it to my test mix shortly.

Meanwhile, I see that Kame-san has posted an "idea patch" that I need to
review....

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-26 18:03         ` Christoph Lameter
@ 2007-07-26 18:26           ` Mel Gorman
  0 siblings, 0 replies; 60+ messages in thread
From: Mel Gorman @ 2007-07-26 18:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KAMEZAWA Hiroyuki, linux-mm, Lee Schermerhorn, ak, akpm, pj

On (26/07/07 11:03), Christoph Lameter didst pronounce:
> On Thu, 26 Jul 2007, Mel Gorman wrote:
> 
> > /* policy_zone is the lowest zone index that is present on all nodes */
> > 
> > Right?
> 
> Nope.

I was talking in the context of Kamezawa's patch.

> In a 4 node x86_64 opteron configuration with 8GB memory in 4 2GB 
> chunks you could have
> 
> node 0	ZONE_DMA, ZONE_DMA32   <2GB
> node 1  ZONE_DMA32		<4GB
> node 2	ZONE_NORMAL		<6GB
> node 3  ZONE_NORMAL		<8GB
> 
> So the highest zone gets partitioned off? We only have ZONE_MOVABLE on 
> nodes 2 and 3?
> 

Yes, that is definitly the case with current behaviour.

> There are some other weirdnesses possible with ZONE_MOVABLE on !NUMA.
> 
> 1GB i386 system
> 
> ZONE_DMA
> ZONE_NORMAL <900k
> ZONE_HIGHEMEM	100k size
> 
> ZONE_MOVABLE can then only use 100k?

Correct.

While it would be possible to have highest zone on each node being used
to make up ZONE_MOVABLE, the required code does not exist but could be
supported. Now that the zone is in mainline, the required effort to support
that situation is worth it but it wasn't worth the development effort earlier.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-26 18:07       ` Christoph Lameter
@ 2007-07-26 22:59         ` Mel Gorman
  2007-07-27  1:22           ` Christoph Lameter
                             ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Mel Gorman @ 2007-07-26 22:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, akpm, pj

On (26/07/07 11:07), Christoph Lameter didst pronounce:
> On Thu, 26 Jul 2007, Mel Gorman wrote:
> 
> > > How about changing __alloc_pages to lookup the zonelist on its own based 
> > > on a node parameter and a set of allowed nodes? That may significantly 
> > > clean up the memory policy layer and the cpuset layer. But it will 
> > > increase the effort to scan zonelists on each allocation. A large system 
> > > with 1024 nodes may have more than 1024 zones on each nodelist!
> > > 
> > 
> > That sounds like it would require the creation of a zonelist for each
> > allocation attempt. That is not ideal as there is no place to allocate
> > the zonelist during __alloc_pages(). It's not like it can call
> > kmalloc().
> 
> Nope it would just require scanning the full zonelists on every alloc as 
> you already propose.
> 

Right. For this current problem, I would rather not to that. I would rather
fix the bug at hand for 2.6.23 and aim to reduce the number of zonelists in
the next timeframe after a spell in -mm and wider testing. This is to reduce
the risk of introducing performance regressions for a bugfix.

> > > Nope it would not fail. NUMAQ has policy_zone == HIGHMEM and slab 
> > > allocations do not use highmem.
> > 
> > It would fail if policy_zone didn't exist, that was my point. Without
> > policy_zone, we apply policy to all allocations and that causes
> > problems.
> 
> policy_zone can not exist due to ZONE_DMA32 ZONE_NORMAL issues. See my 
> other email.
> 
> 
> > I ran the patch on a wide variety of machines, NUMA and non-NUMA. The
> > non-NUMA machines showed no differences as you would expect for
> > kernbench and aim9. On NUMA machines, I saw both small gains and small
> > regressions. By and large, the performance was the same or within 0.08%
> > for kernbench which is within noise basically.
> 
> Sound okay.
> 
> > It might be more pronounced on larger NUMA machines though, I cannot
> > generate those figures.
> 
> I say lets go with the filtering. That would allow us to also catch other 
> issues that are now developing on x86_64 with ZONE_NORMAL and ZONE_DMA32.
>  
> > I'll try adding a should_filter to zonelist that is only set for
> > MPOL_BIND and see what it looks like.
> 
> Maybe that is not worth it.

This patch filters only when MPOL_BIND is in use. In non-numa, the
checks do not exist and in NUMA cases, the filtering usually does not
take place. I'd like this to be the bug fix for policy + ZONE_MOVABLE
and then deal with reducing zonelists to see if there is any performance
gain as well as a simplification in how policies and cpusets are
implemented.

Testing shows no difference on non-numa as you'd expect and on NUMA machines,
there are very small differences on NUMA (kernbench figures range from -0.02%
to 0.15% differences on machines). Lee, can you test this patch in relation
to MPOL_BIND?  I'll look at the numactl tests tomorrow as well.

Comments?

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index e147cf5..5bdd656 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -166,7 +166,7 @@ extern enum zone_type policy_zone;
 
 static inline void check_highest_zone(enum zone_type k)
 {
-	if (k > policy_zone)
+	if (k > policy_zone && k != ZONE_MOVABLE)
 		policy_zone = k;
 }
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index da8eb8a..eb7cb56 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -411,6 +411,24 @@ struct zonelist {
 #endif
 };
 
+#ifdef CONFIG_NUMA
+/*
+ * Only custom zonelists like MPOL_BIND need to be filtered as part of
+ * policies. As described in the comment for struct zonelist_cache, these
+ * zonelists will not have a zlcache so zlcache_ptr will not be set. Use
+ * that to determine if the zonelists needs to be filtered or not.
+ */
+static inline int alloc_should_filter_zonelist(struct zonelist *zonelist)
+{
+	return !zonelist->zlcache_ptr;
+}
+#else
+static inline int alloc_should_filter_zonelist(struct zonelist *zonelist)
+{
+	return 0;
+}
+#endif /* CONFIG_NUMA */
+
 #ifdef CONFIG_ARCH_POPULATES_NODE_MAP
 struct node_active_region {
 	unsigned long start_pfn;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 71b84b4..172abff 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -149,7 +149,7 @@ static struct zonelist *bind_zonelist(nodemask_t *nodes)
 	   lower zones etc. Avoid empty zones because the memory allocator
 	   doesn't like them. If you implement node hot removal you
 	   have to fix that. */
-	k = policy_zone;
+	k = MAX_NR_ZONES - 1;
 	while (1) {
 		for_each_node_mask(nd, *nodes) { 
 			struct zone *z = &NODE_DATA(nd)->node_zones[k];
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 40954fb..99c5a53 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1157,6 +1157,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
+	enum zone_type highest_zoneidx = -1; /* Gets set for policy zonelists */
 
 zonelist_scan:
 	/*
@@ -1166,6 +1167,18 @@ zonelist_scan:
 	z = zonelist->zones;
 
 	do {
+		/*
+		 * In NUMA, this could be a policy zonelist which contains
+		 * zones that may not be allowed by the current gfp_mask.
+		 * Check the zone is allowed by the current flags
+		 */
+		if (unlikely(alloc_should_filter_zonelist(zonelist))) {
+			if (highest_zoneidx == -1)
+				highest_zoneidx = gfp_zone(gfp_mask);
+			if (zone_idx(*z) > highest_zoneidx)
+				continue;
+		}
+
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-26 22:59         ` Mel Gorman
@ 2007-07-27  1:22           ` Christoph Lameter
  2007-07-27  8:20             ` Mel Gorman
  2007-07-27 14:24           ` Lee Schermerhorn
  2007-08-01 18:59           ` Lee Schermerhorn
  2 siblings, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2007-07-27  1:22 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, akpm, pj

On Thu, 26 Jul 2007, Mel Gorman wrote:

> Comments?

Lets go with the unconditional filtering and get rid of some of the per 
node zonelists? We could f.e. merge the lists for ZONE_MOVABLE and 
ZONE_base_of_zone_movable? That may increase the cacheability of the 
zonelists and reduce cache footprint.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-27  1:22           ` Christoph Lameter
@ 2007-07-27  8:20             ` Mel Gorman
  2007-07-27 15:45               ` Mel Gorman
  0 siblings, 1 reply; 60+ messages in thread
From: Mel Gorman @ 2007-07-27  8:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, akpm, pj

On (26/07/07 18:22), Christoph Lameter didst pronounce:
> On Thu, 26 Jul 2007, Mel Gorman wrote:
> 
> > Comments?
> 
> Lets go with the unconditional filtering and get rid of some of the per 
> node zonelists?

I would prefer to go with this for 2.6.23 and work on that for 2.6.24.
The patch should be relatively straight-forward (I'll work on it today)
but it would need wider testing than what I can do here, particularly on
the larger machines that needed things like zlcache.

> We could f.e. merge the lists for ZONE_MOVABLE and 
> ZONE_base_of_zone_movable?

That will be fine for freelist management but a mess with respect to
reclaim. I'd rather not go down that rathole.

> That may increase the cacheability of the 
> zonelists and reduce cache footprint.

That should be the case. I'll work on the patch today and see what sort
of results I get.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-26 22:59         ` Mel Gorman
  2007-07-27  1:22           ` Christoph Lameter
@ 2007-07-27 14:24           ` Lee Schermerhorn
  2007-08-01 18:59           ` Lee Schermerhorn
  2 siblings, 0 replies; 60+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 14:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, linux-mm, ak, KAMEZAWA Hiroyuki, akpm, pj,
	Eric Whitney

On Thu, 2007-07-26 at 23:59 +0100, Mel Gorman wrote:
> On (26/07/07 11:07), Christoph Lameter didst pronounce:
> > On Thu, 26 Jul 2007, Mel Gorman wrote:
> > 
> > > > How about changing __alloc_pages to lookup the zonelist on its own based 
> > > > on a node parameter and a set of allowed nodes? That may significantly 
> > > > clean up the memory policy layer and the cpuset layer. But it will 
> > > > increase the effort to scan zonelists on each allocation. A large system 
> > > > with 1024 nodes may have more than 1024 zones on each nodelist!
> > > > 
> > > 
> > > That sounds like it would require the creation of a zonelist for each
> > > allocation attempt. That is not ideal as there is no place to allocate
> > > the zonelist during __alloc_pages(). It's not like it can call
> > > kmalloc().
> > 
> > Nope it would just require scanning the full zonelists on every alloc as 
> > you already propose.
> > 
> 
> Right. For this current problem, I would rather not to that. I would rather
> fix the bug at hand for 2.6.23 and aim to reduce the number of zonelists in
> the next timeframe after a spell in -mm and wider testing. This is to reduce
> the risk of introducing performance regressions for a bugfix.
> 
> > > > Nope it would not fail. NUMAQ has policy_zone == HIGHMEM and slab 
> > > > allocations do not use highmem.
> > > 
> > > It would fail if policy_zone didn't exist, that was my point. Without
> > > policy_zone, we apply policy to all allocations and that causes
> > > problems.
> > 
> > policy_zone can not exist due to ZONE_DMA32 ZONE_NORMAL issues. See my 
> > other email.
> > 
> > 
> > > I ran the patch on a wide variety of machines, NUMA and non-NUMA. The
> > > non-NUMA machines showed no differences as you would expect for
> > > kernbench and aim9. On NUMA machines, I saw both small gains and small
> > > regressions. By and large, the performance was the same or within 0.08%
> > > for kernbench which is within noise basically.
> > 
> > Sound okay.
> > 
> > > It might be more pronounced on larger NUMA machines though, I cannot
> > > generate those figures.
> > 
> > I say lets go with the filtering. That would allow us to also catch other 
> > issues that are now developing on x86_64 with ZONE_NORMAL and ZONE_DMA32.
> >  
> > > I'll try adding a should_filter to zonelist that is only set for
> > > MPOL_BIND and see what it looks like.
> > 
> > Maybe that is not worth it.
> 
> This patch filters only when MPOL_BIND is in use. In non-numa, the
> checks do not exist and in NUMA cases, the filtering usually does not
> take place. I'd like this to be the bug fix for policy + ZONE_MOVABLE
> and then deal with reducing zonelists to see if there is any performance
> gain as well as a simplification in how policies and cpusets are
> implemented.
> 
> Testing shows no difference on non-numa as you'd expect and on NUMA machines,
> there are very small differences on NUMA (kernbench figures range from -0.02%
> to 0.15% differences on machines). Lee, can you test this patch in relation
> to MPOL_BIND?  I'll look at the numactl tests tomorrow as well.
> 
> Comments?
> 
<snip>

Mel,

I'll queue this up.  Not sure I'll get to it before the weekend, tho'.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-27  8:20             ` Mel Gorman
@ 2007-07-27 15:45               ` Mel Gorman
  2007-07-27 17:35                 ` Christoph Lameter
  2007-07-28  7:28                 ` NUMA policy issues with ZONE_MOVABLE KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 60+ messages in thread
From: Mel Gorman @ 2007-07-27 15:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, akpm, pj

On (27/07/07 09:20), Mel Gorman didst pronounce:
> On (26/07/07 18:22), Christoph Lameter didst pronounce:
> > On Thu, 26 Jul 2007, Mel Gorman wrote:
> > 
> > > Comments?
> > 
> > Lets go with the unconditional filtering and get rid of some of the per 
> > node zonelists?
> 
> I would prefer to go with this for 2.6.23 and work on that for 2.6.24.
> The patch should be relatively straight-forward (I'll work on it today)
> but it would need wider testing than what I can do here, particularly on
> the larger machines that needed things like zlcache.
> 
> > We could f.e. merge the lists for ZONE_MOVABLE and 
> > ZONE_base_of_zone_movable?
> 
> That will be fine for freelist management but a mess with respect to
> reclaim. I'd rather not go down that rathole.
> 
> > That may increase the cacheability of the 
> > zonelists and reduce cache footprint.
> 
> That should be the case. I'll work on the patch today and see what sort
> of results I get.
> 

This was fairly straight-forward but I wouldn't call it a bug fix for 2.6.23
for the policys + ZONE_MOVABLE issue; I still prefer the last patch for
the fix.

This patch uses one zonelist per node and filters based on a gfp_mask where
necessary. It consumes less memory and reduces cache pressure at the cost
of CPU. It also adds a zone_id field to struct zone as zone_idx is used more
than it was previously.

Performance differences on kernbench for Total CPU time ranged from
-0.06% to +1.19%.

Obvious things that are outstanding;

o Compile-test parisc
o Split patch in two to keep the zone_idx changes separetly
o Verify zlccache is not broken
o Have a version of __alloc_pages take a nodemask and ditch
  bind_zonelist()

I can work on bringing this up to scratch during the cycle.

Patch as follows. Comments?

--- 
 arch/parisc/mm/init.c     |    7 ++
 drivers/char/sysrq.c      |    2 
 fs/buffer.c               |    2 
 include/linux/gfp.h       |   10 +++-
 include/linux/mempolicy.h |    4 -
 include/linux/mmzone.h    |    7 +-
 mm/mempolicy.c            |    8 +--
 mm/oom_kill.c             |    7 ++
 mm/page_alloc.c           |  112 ++++++++++++++++++++++------------------------
 mm/slab.c                 |    7 ++
 mm/slub.c                 |    7 ++
 mm/vmscan.c               |    6 ++
 12 files changed, 101 insertions(+), 78 deletions(-)

diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
index e724b36..4d417c4 100644
--- a/arch/parisc/mm/init.c
+++ b/arch/parisc/mm/init.c
@@ -602,12 +602,15 @@ void show_mem(void)
 		int i, j, k;
 
 		for (i = 0; i < npmem_ranges; i++) {
+			zl = &NODE_DATA(i)->node_zonelist;
 			for (j = 0; j < MAX_NR_ZONES; j++) {
-				zl = NODE_DATA(i)->node_zonelists + j;
 
 				printk("Zone list for zone %d on node %d: ", j, i);
-				for (k = 0; zl->zones[k] != NULL; k++) 
+				for (k = 0; zl->zones[k] != NULL; k++)  {
+					if (should_filter_zone(zl->zones[k]), j)
+						continue;
 					printk("[%ld/%s] ", zone_to_nid(zl->zones[k]), zl->zones[k]->name);
+				}
 				printk("\n");
 			}
 		}
diff --git a/drivers/char/sysrq.c b/drivers/char/sysrq.c
index 39cc318..b56d17f 100644
--- a/drivers/char/sysrq.c
+++ b/drivers/char/sysrq.c
@@ -270,7 +270,7 @@ static struct sysrq_key_op sysrq_term_op = {
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(&NODE_DATA(0)->node_zonelists[ZONE_NORMAL],
+	out_of_memory(&NODE_DATA(0)->node_zonelist,
 			GFP_KERNEL, 0);
 }
 
diff --git a/fs/buffer.c b/fs/buffer.c
index 0e5ec37..8e9bbef 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -354,7 +354,7 @@ static void free_more_memory(void)
 	yield();
 
 	for_each_online_pgdat(pgdat) {
-		zones = pgdat->node_zonelists[gfp_zone(GFP_NOFS)].zones;
+		zones = pgdat->node_zonelist.zones;
 		if (*zones)
 			try_to_free_pages(zones, 0, GFP_NOFS);
 	}
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index bc68dd9..f2a597e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -116,6 +116,13 @@ static inline enum zone_type gfp_zone(gfp_t flags)
 	return ZONE_NORMAL;
 }
 
+static inline int should_filter_zone(struct zone *zone, int highest_zoneidx)
+{
+	if (zone_idx(zone) > highest_zoneidx)
+		return 1;
+	return 0;
+}
+
 /*
  * There is only one page-allocator function, and two main namespaces to
  * it. The alloc_page*() variants return 'struct page *' and as such
@@ -151,8 +158,7 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 	if (nid < 0)
 		nid = numa_node_id();
 
-	return __alloc_pages(gfp_mask, order,
-		NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));
+	return __alloc_pages(gfp_mask, order, &NODE_DATA(nid)->node_zonelist);
 }
 
 #ifdef CONFIG_NUMA
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index e147cf5..83e5256 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -166,7 +166,7 @@ extern enum zone_type policy_zone;
 
 static inline void check_highest_zone(enum zone_type k)
 {
-	if (k > policy_zone)
+	if (k > policy_zone && k != ZONE_MOVABLE)
 		policy_zone = k;
 }
 
@@ -258,7 +258,7 @@ static inline void mpol_fix_fork_child_flag(struct task_struct *p)
 static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 		unsigned long addr, gfp_t gfp_flags)
 {
-	return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags);
+	return &NODE_DATA(0)->node_zonelist;
 }
 
 static inline int do_migrate_pages(struct mm_struct *mm,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index da8eb8a..7a0533e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -202,6 +202,7 @@ struct zone {
 	 */
 	unsigned long		lowmem_reserve[MAX_NR_ZONES];
 
+	int zone_idx;
 #ifdef CONFIG_NUMA
 	int node;
 	/*
@@ -438,7 +439,7 @@ extern struct page *mem_map;
 struct bootmem_data;
 typedef struct pglist_data {
 	struct zone node_zones[MAX_NR_ZONES];
-	struct zonelist node_zonelists[MAX_NR_ZONES];
+	struct zonelist node_zonelist;
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP
 	struct page *node_mem_map;
@@ -502,7 +503,7 @@ unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long);
 /*
  * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
  */
-#define zone_idx(zone)		((zone) - (zone)->zone_pgdat->node_zones)
+#define zone_idx(zone)		((zone)->zone_idx)
 
 static inline int populated_zone(struct zone *zone)
 {
@@ -544,7 +545,7 @@ static inline int is_normal_idx(enum zone_type idx)
 static inline int is_highmem(struct zone *zone)
 {
 #ifdef CONFIG_HIGHMEM
-	int zone_idx = zone - zone->zone_pgdat->node_zones;
+	int zone_idx = zone_idx(zone);
 	return zone_idx == ZONE_HIGHMEM ||
 		(zone_idx == ZONE_MOVABLE && zone_movable_is_highmem());
 #else
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 71b84b4..8b16ca3 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -149,7 +149,7 @@ static struct zonelist *bind_zonelist(nodemask_t *nodes)
 	   lower zones etc. Avoid empty zones because the memory allocator
 	   doesn't like them. If you implement node hot removal you
 	   have to fix that. */
-	k = policy_zone;
+	k = MAX_NR_ZONES - 1;
 	while (1) {
 		for_each_node_mask(nd, *nodes) { 
 			struct zone *z = &NODE_DATA(nd)->node_zones[k];
@@ -1116,7 +1116,7 @@ static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
 		nd = 0;
 		BUG();
 	}
-	return NODE_DATA(nd)->node_zonelists + gfp_zone(gfp);
+	return &NODE_DATA(nd)->node_zonelist;
 }
 
 /* Do dynamic interleaving for a process */
@@ -1212,7 +1212,7 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
-		return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags);
+		return &NODE_DATA(nid)->node_zonelist;
 	}
 	return zonelist_policy(GFP_HIGHUSER, pol);
 }
@@ -1226,7 +1226,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
 	struct zonelist *zl;
 	struct page *page;
 
-	zl = NODE_DATA(nid)->node_zonelists + gfp_zone(gfp);
+	zl = &NODE_DATA(nid)->node_zonelist;
 	page = __alloc_pages(gfp, order, zl);
 	if (page && page_zone(page) == zl->zones[0])
 		inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index a700141..8b36019 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -178,6 +178,7 @@ static inline int constrained_alloc(struct zonelist *zonelist, gfp_t gfp_mask)
 	struct zone **z;
 	nodemask_t nodes;
 	int node;
+	enum zone_type highest_zoneidx = gfp_zone(gfp_mask);
 
 	nodes_clear(nodes);
 	/* node has memory ? */
@@ -185,11 +186,15 @@ static inline int constrained_alloc(struct zonelist *zonelist, gfp_t gfp_mask)
 		if (NODE_DATA(node)->node_present_pages)
 			node_set(node, nodes);
 
-	for (z = zonelist->zones; *z; z++)
+	for (z = zonelist->zones; *z; z++) {
+
+		if (should_filter_zone(*z, highest_zoneidx))
+			continue;
 		if (cpuset_zone_allowed_softwall(*z, gfp_mask))
 			node_clear(zone_to_nid(*z), nodes);
 		else
 			return CONSTRAINT_CPUSET;
+	}
 
 	if (!nodes_empty(nodes))
 		return CONSTRAINT_MEMORY_POLICY;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 40954fb..3ad57af 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1157,6 +1157,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
+	enum zone_type highest_zoneidx = gfp_zone(gfp_mask);
 
 zonelist_scan:
 	/*
@@ -1166,6 +1167,9 @@ zonelist_scan:
 	z = zonelist->zones;
 
 	do {
+		if (should_filter_zone(*z, highest_zoneidx))
+			continue;
+
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
@@ -1456,11 +1460,11 @@ static unsigned int nr_free_zone_pages(int offset)
 	pg_data_t *pgdat = NODE_DATA(numa_node_id());
 	unsigned int sum = 0;
 
-	struct zonelist *zonelist = pgdat->node_zonelists + offset;
-	struct zone **zonep = zonelist->zones;
-	struct zone *zone;
+	struct zone **zonep = pgdat->node_zonelist.zones;
+	struct zone *zone = *zonep;
 
-	for (zone = *zonep++; zone; zone = *zonep++) {
+	for (zone = *zonep++; zone && zone_idx(zone) > offset; zone = *zonep++);
+	for (; zone; zone = *zonep++) {
 		unsigned long size = zone->present_pages;
 		unsigned long high = zone->pages_high;
 		if (size > high)
@@ -1819,17 +1823,14 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
  */
 static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
 {
-	enum zone_type i;
 	int j;
 	struct zonelist *zonelist;
 
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		zonelist = pgdat->node_zonelists + i;
-		for (j = 0; zonelist->zones[j] != NULL; j++)
-			;
- 		j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
-		zonelist->zones[j] = NULL;
-	}
+	zonelist = &pgdat->node_zonelist;
+	for (j = 0; zonelist->zones[j] != NULL; j++)
+		;
+ 	j = build_zonelists_node(NODE_DATA(node), zonelist, j, MAX_NR_ZONES-1);
+	zonelist->zones[j] = NULL;
 }
 
 /*
@@ -1842,27 +1843,24 @@ static int node_order[MAX_NUMNODES];
 
 static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
 {
-	enum zone_type i;
 	int pos, j, node;
 	int zone_type;		/* needs to be signed */
 	struct zone *z;
 	struct zonelist *zonelist;
 
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		zonelist = pgdat->node_zonelists + i;
-		pos = 0;
-		for (zone_type = i; zone_type >= 0; zone_type--) {
-			for (j = 0; j < nr_nodes; j++) {
-				node = node_order[j];
-				z = &NODE_DATA(node)->node_zones[zone_type];
-				if (populated_zone(z)) {
-					zonelist->zones[pos++] = z;
-					check_highest_zone(zone_type);
-				}
+	zonelist = &pgdat->node_zonelist;
+	pos = 0;
+	for (zone_type = MAX_NR_ZONES-1; zone_type >= 0; zone_type--) {
+		for (j = 0; j < nr_nodes; j++) {
+			node = node_order[j];
+			z = &NODE_DATA(node)->node_zones[zone_type];
+			if (populated_zone(z)) {
+				zonelist->zones[pos++] = z;
+				check_highest_zone(zone_type);
 			}
 		}
-		zonelist->zones[pos] = NULL;
 	}
+	zonelist->zones[pos] = NULL;
 }
 
 static int default_zonelist_order(void)
@@ -1929,17 +1927,14 @@ static void set_zonelist_order(void)
 static void build_zonelists(pg_data_t *pgdat)
 {
 	int j, node, load;
-	enum zone_type i;
 	nodemask_t used_mask;
 	int local_node, prev_node;
 	struct zonelist *zonelist;
 	int order = current_zonelist_order;
 
-	/* initialize zonelists */
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		zonelist = pgdat->node_zonelists + i;
-		zonelist->zones[0] = NULL;
-	}
+	/* initialize zonelist */
+	zonelist = &pgdat->node_zonelist;
+	zonelist->zones[0] = NULL;
 
 	/* NUMA-aware ordering of nodes */
 	local_node = pgdat->node_id;
@@ -1993,7 +1988,7 @@ static void build_zonelist_cache(pg_data_t *pgdat)
 		struct zonelist_cache *zlc;
 		struct zone **z;
 
-		zonelist = pgdat->node_zonelists + i;
+		zonelist = &pgdat->node_zonelist;
 		zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
 		bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
 		for (z = zonelist->zones; *z; z++)
@@ -2012,36 +2007,36 @@ static void set_zonelist_order(void)
 static void build_zonelists(pg_data_t *pgdat)
 {
 	int node, local_node;
-	enum zone_type i,j;
+	enum zone_type j;
+	struct zonelist *zonelist;
 
 	local_node = pgdat->node_id;
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		struct zonelist *zonelist;
 
-		zonelist = pgdat->node_zonelists + i;
-
- 		j = build_zonelists_node(pgdat, zonelist, 0, i);
- 		/*
- 		 * Now we build the zonelist so that it contains the zones
- 		 * of all the other nodes.
- 		 * We don't want to pressure a particular node, so when
- 		 * building the zones for node N, we make sure that the
- 		 * zones coming right after the local ones are those from
- 		 * node N+1 (modulo N)
- 		 */
-		for (node = local_node + 1; node < MAX_NUMNODES; node++) {
-			if (!node_online(node))
-				continue;
-			j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
-		}
-		for (node = 0; node < local_node; node++) {
-			if (!node_online(node))
-				continue;
-			j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
-		}
+	zonelist = &pgdat->node_zonelist;
+	j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES-1);
 
-		zonelist->zones[j] = NULL;
+ 	/*
+	 * Now we build the zonelist so that it contains the zones
+	 * of all the other nodes.
+	 * We don't want to pressure a particular node, so when
+	 * building the zones for node N, we make sure that the
+	 * zones coming right after the local ones are those from
+	 * node N+1 (modulo N)
+	 */
+	for (node = local_node + 1; node < MAX_NUMNODES; node++) {
+		if (!node_online(node))
+			continue;
+		j = build_zonelists_node(NODE_DATA(node), zonelist, j,
+								MAX_NR_ZONES-1);
 	}
+	for (node = 0; node < local_node; node++) {
+		if (!node_online(node))
+			continue;
+		j = build_zonelists_node(NODE_DATA(node), zonelist, j,
+								MAX_NR_ZONES-1);
+	}
+
+	zonelist->zones[j] = NULL;
 }
 
 /* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
@@ -2050,7 +2045,7 @@ static void build_zonelist_cache(pg_data_t *pgdat)
 	int i;
 
 	for (i = 0; i < MAX_NR_ZONES; i++)
-		pgdat->node_zonelists[i].zlcache_ptr = NULL;
+		pgdat->node_zonelist.zlcache_ptr = NULL;
 }
 
 #endif	/* CONFIG_NUMA */
@@ -2936,6 +2931,7 @@ static void __meminit free_area_init_core(struct pglist_data *pgdat,
 			nr_kernel_pages += realsize;
 		nr_all_pages += realsize;
 
+		zone->zone_idx = j;
 		zone->spanned_pages = size;
 		zone->present_pages = realsize;
 #ifdef CONFIG_NUMA
diff --git a/mm/slab.c b/mm/slab.c
index bde271c..d73fe30 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3216,12 +3216,12 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 	struct zone **z;
 	void *obj = NULL;
 	int nid;
+	enum zone_type highest_zoneidx = gfp_zone(flags);
 
 	if (flags & __GFP_THISNODE)
 		return NULL;
 
-	zonelist = &NODE_DATA(slab_node(current->mempolicy))
-			->node_zonelists[gfp_zone(flags)];
+	zonelist = &NODE_DATA(slab_node(current->mempolicy))->node_zonelist;
 	local_flags = (flags & GFP_LEVEL_MASK);
 
 retry:
@@ -3230,6 +3230,9 @@ retry:
 	 * from existing per node queues.
 	 */
 	for (z = zonelist->zones; *z && !obj; z++) {
+		if (should_filter_zone(*z, highest_zoneidx))
+			continue;
+
 		nid = zone_to_nid(*z);
 
 		if (cpuset_zone_allowed_hardwall(*z, flags) &&
diff --git a/mm/slub.c b/mm/slub.c
index 9b2d617..a020a12 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1276,6 +1276,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
 	struct zonelist *zonelist;
 	struct zone **z;
 	struct page *page;
+	enum zone_type highest_zoneidx = gfp_zone(flags);
 
 	/*
 	 * The defrag ratio allows a configuration of the tradeoffs between
@@ -1298,11 +1299,13 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
 	if (!s->defrag_ratio || get_cycles() % 1024 > s->defrag_ratio)
 		return NULL;
 
-	zonelist = &NODE_DATA(slab_node(current->mempolicy))
-					->node_zonelists[gfp_zone(flags)];
+	zonelist = &NODE_DATA(slab_node(current->mempolicy))->node_zonelist;
 	for (z = zonelist->zones; *z; z++) {
 		struct kmem_cache_node *n;
 
+		if (should_filter_zone(*z, highest_zoneidx))
+			continue;
+
 		n = get_node(s, zone_to_nid(*z));
 
 		if (n && cpuset_zone_allowed_hardwall(*z, flags) &&
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d419e10..8672d61 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1124,6 +1124,7 @@ unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
 	unsigned long nr_reclaimed = 0;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long lru_pages = 0;
+	enum zone_type highest_zoneidx;
 	int i;
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
@@ -1136,9 +1137,14 @@ unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
 
 	count_vm_event(ALLOCSTALL);
 
+	highest_zoneidx = gfp_zone(gfp_mask);
+
 	for (i = 0; zones[i] != NULL; i++) {
 		struct zone *zone = zones[i];
 
+		if (should_filter_zone(zone, highest_zoneidx))
+			continue;
+
 		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 			continue;
 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-27 15:45               ` Mel Gorman
@ 2007-07-27 17:35                 ` Christoph Lameter
  2007-07-27 17:46                   ` Mel Gorman
  2007-07-27 18:00                   ` [PATCH] Document Linux Memory Policy - V2 Lee Schermerhorn
  2007-07-28  7:28                 ` NUMA policy issues with ZONE_MOVABLE KAMEZAWA Hiroyuki
  1 sibling, 2 replies; 60+ messages in thread
From: Christoph Lameter @ 2007-07-27 17:35 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, akpm, pj

On Fri, 27 Jul 2007, Mel Gorman wrote:

> This was fairly straight-forward but I wouldn't call it a bug fix for 2.6.23
> for the policys + ZONE_MOVABLE issue; I still prefer the last patch for
> the fix.
> 
> This patch uses one zonelist per node and filters based on a gfp_mask where
> necessary. It consumes less memory and reduces cache pressure at the cost
> of CPU. It also adds a zone_id field to struct zone as zone_idx is used more
> than it was previously.
> 
> Performance differences on kernbench for Total CPU time ranged from
> -0.06% to +1.19%.

Performance is equal otherwise?
 
> Obvious things that are outstanding;
> 
> o Compile-test parisc
> o Split patch in two to keep the zone_idx changes separetly
> o Verify zlccache is not broken
> o Have a version of __alloc_pages take a nodemask and ditch
>   bind_zonelist()

Yeah. I think the NUMA folks would love this but the rest of the 
developers may object.

> I can work on bringing this up to scratch during the cycle.
> 
> Patch as follows. Comments?

Glad to see some movement in this area. 

> index bc68dd9..f2a597e 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -116,6 +116,13 @@ static inline enum zone_type gfp_zone(gfp_t flags)
>  	return ZONE_NORMAL;
>  }
>  
> +static inline int should_filter_zone(struct zone *zone, int highest_zoneidx)
> +{
> +	if (zone_idx(zone) > highest_zoneidx)
> +		return 1;
> +	return 0;
> +}
> +

I think this should_filter() creates more overhead than which it saves. In 
particular true for configurations with a small number of zones like SMP 
systems. For large NUMA systems the cache savings will likely may it 
beneficial.

Simply filter all.

> @@ -258,7 +258,7 @@ static inline void mpol_fix_fork_child_flag(struct task_struct *p)
>  static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
>  		unsigned long addr, gfp_t gfp_flags)
>  {
> -	return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags);
> +	return &NODE_DATA(0)->node_zonelist;
>  }

These modifications look good in terrms of code size reduction.

> @@ -438,7 +439,7 @@ extern struct page *mem_map;
>  struct bootmem_data;
>  typedef struct pglist_data {
>  	struct zone node_zones[MAX_NR_ZONES];
> -	struct zonelist node_zonelists[MAX_NR_ZONES];
> +	struct zonelist node_zonelist;

Looks like a significant memory savings on 1024 node numa. zonelist has
#define MAX_ZONES_PER_ZONELIST (MAX_NUMNODES * MAX_NR_ZONES)
zones.

> @@ -185,11 +186,15 @@ static inline int constrained_alloc(struct zonelist *zonelist, gfp_t gfp_mask)
>  		if (NODE_DATA(node)->node_present_pages)
>  			node_set(node, nodes);
>  
> -	for (z = zonelist->zones; *z; z++)
> +	for (z = zonelist->zones; *z; z++) {
> +
> +		if (should_filter_zone(*z, highest_zoneidx))
> +			continue;

Huh? Why do you need it here? Note that this code is also going away with 
the memoryless node patch. We can use the nodes with memory nodemask here.

> diff --git a/mm/slub.c b/mm/slub.c
> index 9b2d617..a020a12 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1276,6 +1276,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
>  	struct zonelist *zonelist;
>  	struct zone **z;
>  	struct page *page;
> +	enum zone_type highest_zoneidx = gfp_zone(flags);
>  
>  	/*
>  	 * The defrag ratio allows a configuration of the tradeoffs between
> @@ -1298,11 +1299,13 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
>  	if (!s->defrag_ratio || get_cycles() % 1024 > s->defrag_ratio)
>  		return NULL;
>  
> -	zonelist = &NODE_DATA(slab_node(current->mempolicy))
> -					->node_zonelists[gfp_zone(flags)];
> +	zonelist = &NODE_DATA(slab_node(current->mempolicy))->node_zonelist;
>  	for (z = zonelist->zones; *z; z++) {
>  		struct kmem_cache_node *n;
>  
> +		if (should_filter_zone(*z, highest_zoneidx))
> +			continue;
> +
>  		n = get_node(s, zone_to_nid(*z));
>  
>  		if (n && cpuset_zone_allowed_hardwall(*z, flags) &&

Isnt there some way to fold these traversals into a common page allocator 
function?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-27 17:35                 ` Christoph Lameter
@ 2007-07-27 17:46                   ` Mel Gorman
  2007-07-27 18:38                     ` Christoph Lameter
  2007-07-27 18:00                   ` [PATCH] Document Linux Memory Policy - V2 Lee Schermerhorn
  1 sibling, 1 reply; 60+ messages in thread
From: Mel Gorman @ 2007-07-27 17:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, akpm, pj

On (27/07/07 10:35), Christoph Lameter didst pronounce:
> On Fri, 27 Jul 2007, Mel Gorman wrote:
> 
> > This was fairly straight-forward but I wouldn't call it a bug fix for 2.6.23
> > for the policys + ZONE_MOVABLE issue; I still prefer the last patch for
> > the fix.
> > 
> > This patch uses one zonelist per node and filters based on a gfp_mask where
> > necessary. It consumes less memory and reduces cache pressure at the cost
> > of CPU. It also adds a zone_id field to struct zone as zone_idx is used more
> > than it was previously.
> > 
> > Performance differences on kernbench for Total CPU time ranged from
> > -0.06% to +1.19%.
> 
> Performance is equal otherwise?
>  

Initial tests imply yes but I haven't done broader tests yet. It saves 64
bytes on the size of the node structure on a non-numa i386 machine so even
that might be noticable in some cases.

> > Obvious things that are outstanding;
> > 
> > o Compile-test parisc
> > o Split patch in two to keep the zone_idx changes separetly
> > o Verify zlccache is not broken
> > o Have a version of __alloc_pages take a nodemask and ditch
> >   bind_zonelist()
> 
> Yeah. I think the NUMA folks would love this but the rest of the 
> developers may object.
> 
> > I can work on bringing this up to scratch during the cycle.
> > 
> > Patch as follows. Comments?
> 
> Glad to see some movement in this area. 
> 
> > index bc68dd9..f2a597e 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -116,6 +116,13 @@ static inline enum zone_type gfp_zone(gfp_t flags)
> >  	return ZONE_NORMAL;
> >  }
> >  
> > +static inline int should_filter_zone(struct zone *zone, int highest_zoneidx)
> > +{
> > +	if (zone_idx(zone) > highest_zoneidx)
> > +		return 1;
> > +	return 0;
> > +}
> > +
> 
> I think this should_filter() creates more overhead than which it saves.

It's why part of the patch adds a zone_idx field to struct zone instead
of mucking around with pgdat->node_zones.

> In 
> particular true for configurations with a small number of zones like SMP 
> systems. For large NUMA systems the cache savings will likely may it 
> beneficial.
> 
> Simply filter all.
> 

What do you mean by simply filter all? The should_filter_zone() is
returning 1 if the zone should not be used for the current gfp_mask. It
would be easier to read (but slower) if it was expressed as

if (zone_idx(zone) > gfp_zone(gfp_mask))
	return 1;

so that zones unsuitable for gfp_mask are ignored.

> > @@ -258,7 +258,7 @@ static inline void mpol_fix_fork_child_flag(struct task_struct *p)
> >  static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> >  		unsigned long addr, gfp_t gfp_flags)
> >  {
> > -	return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags);
> > +	return &NODE_DATA(0)->node_zonelist;
> >  }
> 
> These modifications look good in terrms of code size reduction.
> 

720 bytes less in the size of the text section for a standalone non-numa
machine.

> > @@ -438,7 +439,7 @@ extern struct page *mem_map;
> >  struct bootmem_data;
> >  typedef struct pglist_data {
> >  	struct zone node_zones[MAX_NR_ZONES];
> > -	struct zonelist node_zonelists[MAX_NR_ZONES];
> > +	struct zonelist node_zonelist;
> 
> Looks like a significant memory savings on 1024 node numa. zonelist has
> #define MAX_ZONES_PER_ZONELIST (MAX_NUMNODES * MAX_NR_ZONES)
> zones.
> 

I'll gather figures.

> > @@ -185,11 +186,15 @@ static inline int constrained_alloc(struct zonelist *zonelist, gfp_t gfp_mask)
> >  		if (NODE_DATA(node)->node_present_pages)
> >  			node_set(node, nodes);
> >  
> > -	for (z = zonelist->zones; *z; z++)
> > +	for (z = zonelist->zones; *z; z++) {
> > +
> > +		if (should_filter_zone(*z, highest_zoneidx))
> > +			continue;
> 
> Huh? Why do you need it here? Note that this code is also going away with 
> the memoryless node patch. We can use the nodes with memory nodemask here.
> 

This function expects to walk a zonelist suitable for the gfp_mask. As
the zonelists it gets has potentially unsuitable zones in it, it must be
filtered as well so that it is functionally identical.

> > diff --git a/mm/slub.c b/mm/slub.c
> > index 9b2d617..a020a12 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1276,6 +1276,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
> >  	struct zonelist *zonelist;
> >  	struct zone **z;
> >  	struct page *page;
> > +	enum zone_type highest_zoneidx = gfp_zone(flags);
> >  
> >  	/*
> >  	 * The defrag ratio allows a configuration of the tradeoffs between
> > @@ -1298,11 +1299,13 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
> >  	if (!s->defrag_ratio || get_cycles() % 1024 > s->defrag_ratio)
> >  		return NULL;
> >  
> > -	zonelist = &NODE_DATA(slab_node(current->mempolicy))
> > -					->node_zonelists[gfp_zone(flags)];
> > +	zonelist = &NODE_DATA(slab_node(current->mempolicy))->node_zonelist;
> >  	for (z = zonelist->zones; *z; z++) {
> >  		struct kmem_cache_node *n;
> >  
> > +		if (should_filter_zone(*z, highest_zoneidx))
> > +			continue;
> > +
> >  		n = get_node(s, zone_to_nid(*z));
> >  
> >  		if (n && cpuset_zone_allowed_hardwall(*z, flags) &&
> 
> Isnt there some way to fold these traversals into a common page allocator 
> function?

Probably. When I looked first, each of the users were traversing the zonelist
slightly differently so it wasn't obvious how to have a single iterator but
it's a point for improvement.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH] Document Linux Memory Policy - V2
  2007-07-27 17:35                 ` Christoph Lameter
  2007-07-27 17:46                   ` Mel Gorman
@ 2007-07-27 18:00                   ` Lee Schermerhorn
  2007-07-27 18:38                     ` Randy Dunlap
                                       ` (2 more replies)
  1 sibling, 3 replies; 60+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 18:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Christoph Lameter, ak, Mel Gorman, KAMEZAWA Hiroyuki, akpm, pj,
	Michael Kerrisk, Randy Dunlap, Eric Whitney

Here's a second attempt to document the existing Linux Memory Policy.
I've tried to address comments on the first cut from Christoph and Andi.
I've removed the details of the APIs and referenced the man pages "for
more details".  I've made a stab at addressing the interaction with
cpusets, but more could be done here.

I'm hoping we can get this merged in some form, and then update it with
all of the policy changes that are in the queue and/or being
worked--memoryless nodes, interaction with ZONE_MOVABLE, ... .

Lee

----------------

[PATCH] Document Linux Memory Policy - V2

I couldn't find any memory policy documentation in the Documentation
directory, so here is my attempt to document it.

There's lots more that could be written about the internal design--including
data structures, functions, etc.  However, if you agree that this is better
that the nothing that exists now, perhaps it could be merged.  This will
provide a baseline for updates to document the many policy patches that are
currently being worked.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/vm/memory_policy.txt |  278 +++++++++++++++++++++++++++++++++++++
 1 file changed, 278 insertions(+)

Index: Linux/Documentation/vm/memory_policy.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ Linux/Documentation/vm/memory_policy.txt	2007-07-27 13:40:45.000000000 -0400
@@ -0,0 +1,278 @@
+
+What is Linux Memory Policy?
+
+In the Linux kernel, "memory policy" determines from which node the kernel will
+allocate memory in a NUMA system or in an emulated NUMA system.  Linux has
+supported platforms with Non-Uniform Memory Access architectures since 2.4.?.
+The current memory policy support was added to Linux 2.6 around May 2004.  This
+document attempts to describe the concepts and APIs of the 2.6 memory policy
+support.
+
+See also Documentation/cpusets.txt which describes a higher level,
+administrative mechanism for restricting the set of nodes from which memory
+policy may allocate pages.  Also, see "MEMORY POLICIES AND CPUSETS" below.
+
+MEMORY POLICY CONCEPTS
+
+Scope of Memory Policies
+
+The Linux kernel supports four more or less distinct scopes of memory policy:
+
+    System Default Policy:  this policy is "hard coded" into the kernel.  It
+    is the policy that governs the all page allocations that aren't controlled
+    by one of the more specific policy scopes discussed below.
+
+    Task/Process Policy:  this is an optional, per-task policy.  When defined
+    for a specific task, this policy controls all page allocations made by or
+    on behalf of the task that aren't controlled by a more specific scope.
+    If a task does not define a task policy, then all page allocations that
+    would have been controlled by the task policy "fall back" to the System
+    Default Policy.
+
+	Because task policy applies to the entire address space of a task,
+	it is inheritable across both fork() [clone() w/o the CLONE_VM flag]
+	and exec*().  Thus, a parent task may establish the task policy for
+	a child task exec()'d from an executable image that has no awareness
+	of memory policy.
+
+	In a multi-threaded task, task policies apply only to the thread
+	[Linux kernel task] that installs the policy and any threads
+	subsequently created by that thread.  Any sibling threads existing
+	at the time a new task policy is installed retain their current
+	policy.
+
+	A task policy applies only to pages allocated after the policy is
+	installed.  Any pages already faulted in by the task remain where
+	they were allocated based on the policy at the time they were
+	allocated.
+
+    VMA Policy:  A "VMA" or "Virtual Memory Area" refers to a range of a task's
+    virtual adddress space.  A task may define a specific policy for a range
+    of its virtual address space.  This VMA policy will govern the allocation
+    of pages that back this region of the address space.  Any regions of the
+    task's address space that don't have an explicit VMA policy will fall back
+    to the task policy, which may itself fall back to the system default policy.
+
+	VMA policy applies ONLY to anonymous pages.  These include pages
+	allocated for anonymous segments, such as the task stack and heap, and
+	any regions of the address space mmap()ed with the MAP_ANONYMOUS flag.
+	Anonymous pages copied from private file mappings [files mmap()ed with
+	the MAP_PRIVATE flag] also obey VMA policy, if defined.
+
+	VMA policies are shared between all tasks that share a virtual address
+	space--a.k.a. threads--independent of when the policy is installed; and
+	they are inherited across fork().  However, because VMA policies refer
+	to a specific region of a task's address space, and because the address
+	space is discarded and recreated on exec*(), VMA policies are NOT
+	inheritable across exec().  Thus, only NUMA-aware applications may
+	use VMA policies.
+
+	A task may install a new VMA policy on a sub-range of a previously
+	mmap()ed region.  When this happens, Linux splits the existing virtual
+	memory area into 2 or 3 VMAs, each with it's own policy.
+
+	By default, VMA policy applies only to pages allocated after the policy
+	is installed.  Any pages already faulted into the VMA range remain where
+	they were allocated based on the policy at the time they were
+	allocated.  However, since 2.6.16, Linux supports page migration so
+	that page contents can be moved to match a newly installed policy.
+
+    Shared Policy:  This policy applies to "memory objects" mapped shared into
+    one or more tasks' distinct address spaces.  Shared policies are applied
+    directly to the shared object.  Thus, all tasks that attach to the object
+    share the policy, and all pages allocated for the shared object, by any
+    task, will obey the shared policy.
+
+	Currently [2.6.22], only shared memory segments, created by shmget(),
+	support shared policy.  When shared policy support was added to Linux,
+	the associated data structures were added to shared hugetlbfs segments.
+	However, at the time, hugetlbfs did not support allocation at fault
+	time--a.k.a lazy allocation--so hugetlbfs segments were never "hooked
+	up" to the shared policy support.  Although hugetlbfs segments now
+	support lazy allocation, their support for shared policy has not been
+	completed.
+
+	Although internal to the kernel shared memory segments are really
+	files backed by swap space that have been mmap()ed shared into tasks'
+	address spaces, regular files mmap()ed shared do NOT support shared
+	policy.  Rather, shared page cache pages, including pages backing
+	private mappings that have not yet been written by the task, follow
+	task policy, if any, else system default policy.
+
+	The shared policy infrastructure supports different policies on subset
+	ranges of the shared object.  However, Linux still splits the VMA of
+	the task that installs the policy for each range of distinct policy.
+	Thus, different tasks that attach to a shared memory segment can have
+	different VMA configurations mapping that one shared object.
+
+Components of Memory Policies
+
+    A Linux memory policy is a tuple consisting of a "mode" and an optional set
+    of nodes.  The mode determine the behavior of the policy, while the optional
+    set of nodes can be viewed as the arguments to the behavior.
+
+   Internally, memory policies are implemented by a reference counted structure,
+   struct mempolicy.  Details of this structure will be discussed in context,
+   below.
+
+	Note:  in some functions AND in the struct mempolicy, the mode is
+	called "policy".  However, to avoid confusion with the policy tuple,
+	this document will continue to use the term "mode".
+
+   Linux memory policy supports the following 4 modes:
+
+	Default Mode--MPOL_DEFAULT:  The behavior specified by this mode is
+	context dependent.
+
+	    During normal system operation, the system default policy is hard
+	    coded to contain the Default mode.  During system boot up, the
+	    system default policy is temporarily set to MPOL_INTERLEAVE [see
+	    below] to distribute boot time allocations across all nodes in
+	    the system, instead of using just the node containing the boot cpu.
+
+	    In this context, default mode means "local" allocation--that is
+	    attempt to allocate the page from the node associated with the cpu
+	    where the fault occurs.  If the "local" node has no memory, or the
+	    node's memory can be exhausted [no free pages available], local
+	    allocation will attempt to allocate pages from "nearby" nodes, using
+	    a per node list of nodes--called zonelists--built at boot time, or
+	    when nodes or memory are added or removed from the system [memory
+	    hotplug].
+
+	    When a task/process policy or a shared policy contains the Default
+	    mode, this also means local allocation, as described above.
+
+	    In the context of a VMA, Default mode means "fall back to task
+	    policy"--which may or may not specify Default mode.  Thus, Default
+	    mode can not be counted on to mean local allocation when used
+	    on a non-shared region of the address space.  However, see
+	    MPOL_PREFERRED below.
+
+	    The Default mode does not use the optional set of nodes.
+
+	MPOL_BIND:  This mode specifies that memory must come from the
+	set of nodes specified by the policy.  The kernel builds a custom
+	zonelist pointed to by the zonelist member of struct mempolicy,
+	containing just the nodes specified by the Bind policy.  If the kernel
+	is unable to allocate a page from the first node in the custom zonelist,
+	it moves on to the next, and so forth.  If it is unable to allocate a
+	page from any of the nodes in this list, the allocation will fail.
+
+	    The memory policy APIs do not specify an order in which the nodes
+	    will be searched.  However, unlike the per node zonelists mentioned
+	    above, the custom zonelist for the Bind policy do not consider the
+	    distance between the nodes.  Rather, the lists are built in order
+	    of numeric node id.
+
+	MPOL_PREFERRED:  This mode specifies that the allocation should be
+	attempted from the single node specified in the policy.  If that
+	allocation fails, the kernel will search other nodes, exactly as
+	it would for a local allocation that started at the preferred node--
+	that is, using the per-node zonelists in increasing distance from
+	the preferred node.
+
+	    Internally, the Preferred policy uses a single node--the
+	    preferred_node member of struct mempolicy.
+
+	    If the Preferred policy node is '-1', then at page allocation time,
+	    the kernel will use the "local node" as the starting point for the
+	    allocation.  This is the way to specify local allocation for a
+	    specific range of addresses--i.e. for VMA policies.
+
+	MPOL_INTERLEAVED:  This mode specifies that page allocations be
+	interleaved, on a page granularity, across the nodes specified in
+	the policy.  This mode also behaves slightly differently, based on
+	the context where it is used:
+
+	    For allocation of anonymous pages and shared memory pages,
+	    Interleave mode indexes the set of nodes specified by the policy
+	    using the page offset of the faulting address into the segment
+	    [VMA] containing the address modulo the number of nodes specified
+	    by the policy.  It then attempts to allocate a page, starting at
+	    the selected node, as if the node had been specified by a Preferred
+	    policy or had been selected by a local allocation.  That is,
+	    allocation will follow the per node zonelist.
+
+	    For allocation of page cache pages, Interleave mode indexes the set
+	    of nodes specified by the policy using a node counter maintained
+	    per task.  This counter wraps around to the lowest specified node
+	    after it reaches the highest specified node.  This will tend to
+	    spread the pages out over the nodes specified by the policy based
+	    on the order in which they are allocated, rather than based on any
+	    page offset into an address range or file.  During system boot up,
+	    the temporary interleaved system default policy works in this
+	    mode.
+
+MEMORY POLICIES AND CPUSETS
+
+Memory policies work within cpusets as described above.  For memory policies
+that require a node or set of nodes, the nodes are restricted to the set of
+nodes whose memories are allowed by the cpuset constraints.  This can be
+problematic for 2 reasons:
+
+1) the memory policy APIs take physical node id's as arguments.  However, the
+   memory policy APIs do not provide a way to determine what nodes are valid
+   in the context where the application is running.  An application MAY consult
+   the cpuset file system [directly or via an out of tree, and not generally
+   available, libcpuset API] to obtain this information, but then the
+   application must be aware that it is running in a cpuset and use what are
+   intended primarily as administrative APIs.
+
+2) when tasks in two cpusets share access to a memory region, such as shared
+   memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and
+   MAP_SHARED flags, only nodes whose memories are allowed in both cpusets
+   may be used in the policies.  Again, obtaining this information requires
+   "stepping outside" the memory policy APIs to use the cpuset information.
+   Furthermore, if the cpusets' "allowed memory" sets are disjoint, "local"
+   allocation is the only valid policy.
+
+MEMORY POLICY APIs
+
+Linux supports 3 system calls for controlling memory policy.  These APIS
+always affect only the calling task, the calling task's address space, or
+some shared object mapped into the calling task's address space.
+
+	Note:  the headers that define these APIs and the parameter data types
+	for user space applications reside in a package that is not part of
+	the Linux kernel.  The kernel system call interfaces, with the 'sys_'
+	prefix, are defined in <linux/syscalls.h>; the mode and flag
+	definitions are defined in <linux/mempolicy.h>.
+
+Set [Task] Memory Policy:
+
+	long set_mempolicy(int mode, const unsigned long *nmask,
+					unsigned long maxnode);
+
+	Set's the calling task's "task/process memory policy" to mode
+	specified by the 'mode' argument and the set of nodes defined
+	by 'nmask'.  'nmask' points to a bit mask of node ids containing
+	at least 'maxnode' ids.
+
+	See the set_mempolicy(2) man page for more details
+
+
+Get [Task] Memory Policy or Related Information
+
+	long get_mempolicy(int *mode,
+			   const unsigned long *nmask, unsigned long maxnode,
+			   void *addr, int flags);
+
+	Queries the "task/process memory policy" of the calling task, or
+	the policy or location of a specified virtual address, depending
+	on the 'flags' argument.
+
+	See the get_mempolicy(2) man page for more details
+
+
+Install VMA/Shared Policy for a Range of Task's Address Space
+
+	long mbind(void *start, unsigned long len, int mode,
+		   const unsigned long *nmask, unsigned long maxnode,
+		   unsigned flags);
+
+	mbind() installs the policy specified by (mode, nmask, maxnodes) as
+	a VMA policy for the range of the calling task's address space
+	specified by the 'start' and 'len' arguments.  Additional actions
+	may be requested via the 'flags' argument.
+
+	See the mbind(2) man page for more details.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-27 17:46                   ` Mel Gorman
@ 2007-07-27 18:38                     ` Christoph Lameter
  0 siblings, 0 replies; 60+ messages in thread
From: Christoph Lameter @ 2007-07-27 18:38 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, akpm, pj

On Fri, 27 Jul 2007, Mel Gorman wrote:

> Initial tests imply yes but I haven't done broader tests yet. It saves 64
> bytes on the size of the node structure on a non-numa i386 machine so even
> that might be noticable in some cases.

I think you can minimize the impact further by encoding information you
are looking for in the zone pointer. We are scanning for zones and for
node numbers. The zones require up to 2 bits and the nodes up to 10 bits.
So if we page align the zones structure then we have enough bits to encode
the information we are looking for in the pointers. Thus saving us
dereferencing it to check.

This may even be a performance increase vs the current situation.

> > I think this should_filter() creates more overhead than which it saves.
> 
> It's why part of the patch adds a zone_idx field to struct zone instead
> of mucking around with pgdat->node_zones.

See above. Avoid cacheline fetch by using the low bits of the zone pointer 
for zone_idx.

> > Isnt there some way to fold these traversals into a common page allocator 
> > function?
> 
> Probably. When I looked first, each of the users were traversing the zonelist
> slightly differently so it wasn't obvious how to have a single iterator but
> it's a point for improvement.

I wrote most of those and I'd be glad if you could consolidate the code 
somehow.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] Document Linux Memory Policy - V2
  2007-07-27 18:00                   ` [PATCH] Document Linux Memory Policy - V2 Lee Schermerhorn
@ 2007-07-27 18:38                     ` Randy Dunlap
  2007-07-27 19:01                       ` Lee Schermerhorn
  2007-07-27 18:55                     ` Christoph Lameter
  2007-07-31 15:14                     ` Mel Gorman
  2 siblings, 1 reply; 60+ messages in thread
From: Randy Dunlap @ 2007-07-27 18:38 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, Christoph Lameter, ak, Mel Gorman, KAMEZAWA Hiroyuki,
	akpm, pj, Michael Kerrisk, Eric Whitney

On Fri, 27 Jul 2007 14:00:59 -0400 Lee Schermerhorn wrote:

> [PATCH] Document Linux Memory Policy - V2
> 
>  Documentation/vm/memory_policy.txt |  278 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 278 insertions(+)
> 
> Index: Linux/Documentation/vm/memory_policy.txt
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ Linux/Documentation/vm/memory_policy.txt	2007-07-27 13:40:45.000000000 -0400
> @@ -0,0 +1,278 @@
> +

...

> +
> +MEMORY POLICY CONCEPTS
> +
> +Scope of Memory Policies
> +
> +The Linux kernel supports four more or less distinct scopes of memory policy:
> +
> +    System Default Policy:  this policy is "hard coded" into the kernel.  It
> +    is the policy that governs the all page allocations that aren't controlled

                              drop ^ "the"

> +    by one of the more specific policy scopes discussed below.

Are these policies listed in order of "less specific scope to more
specific scope"?

> +    Task/Process Policy:  this is an optional, per-task policy.  When defined
> +    for a specific task, this policy controls all page allocations made by or
> +    on behalf of the task that aren't controlled by a more specific scope.
> +    If a task does not define a task policy, then all page allocations that
> +    would have been controlled by the task policy "fall back" to the System
> +    Default Policy.
> +

...

> +
> +    VMA Policy:  A "VMA" or "Virtual Memory Area" refers to a range of a task's
> +    virtual adddress space.  A task may define a specific policy for a range
> +    of its virtual address space.  This VMA policy will govern the allocation
> +    of pages that back this region of the address space.  Any regions of the
> +    task's address space that don't have an explicit VMA policy will fall back
> +    to the task policy, which may itself fall back to the system default policy.
> +
> +	VMA policy applies ONLY to anonymous pages.  These include pages
> +	allocated for anonymous segments, such as the task stack and heap, and
> +	any regions of the address space mmap()ed with the MAP_ANONYMOUS flag.
> +	Anonymous pages copied from private file mappings [files mmap()ed with
> +	the MAP_PRIVATE flag] also obey VMA policy, if defined.
> +
> +	VMA policies are shared between all tasks that share a virtual address
> +	space--a.k.a. threads--independent of when the policy is installed; and
> +	they are inherited across fork().  However, because VMA policies refer
> +	to a specific region of a task's address space, and because the address
> +	space is discarded and recreated on exec*(), VMA policies are NOT
> +	inheritable across exec().  Thus, only NUMA-aware applications may
> +	use VMA policies.
> +
> +	A task may install a new VMA policy on a sub-range of a previously
> +	mmap()ed region.  When this happens, Linux splits the existing virtual
> +	memory area into 2 or 3 VMAs, each with it's own policy.

                                                its
> +
> +	By default, VMA policy applies only to pages allocated after the policy
> +	is installed.  Any pages already faulted into the VMA range remain where
> +	they were allocated based on the policy at the time they were
> +	allocated.  However, since 2.6.16, Linux supports page migration so
> +	that page contents can be moved to match a newly installed policy.
> +
> +    Shared Policy:  This policy applies to "memory objects" mapped shared into
> +    one or more tasks' distinct address spaces.  Shared policies are applied
> +    directly to the shared object.  Thus, all tasks that attach to the object
> +    share the policy, and all pages allocated for the shared object, by any
> +    task, will obey the shared policy.
> +
> +	Currently [2.6.22], only shared memory segments, created by shmget(),
> +	support shared policy.  When shared policy support was added to Linux,
> +	the associated data structures were added to shared hugetlbfs segments.
> +	However, at the time, hugetlbfs did not support allocation at fault
> +	time--a.k.a lazy allocation--so hugetlbfs segments were never "hooked

              a.k.a.

> +	up" to the shared policy support.  Although hugetlbfs segments now
> +	support lazy allocation, their support for shared policy has not been
> +	completed.
> +
> +	Although internal to the kernel shared memory segments are really
> +	files backed by swap space that have been mmap()ed shared into tasks'
> +	address spaces, regular files mmap()ed shared do NOT support shared

confusing sentence, esp. the beginning of it.

> +	policy.  Rather, shared page cache pages, including pages backing
> +	private mappings that have not yet been written by the task, follow
> +	task policy, if any, else system default policy.
> +
> +	The shared policy infrastructure supports different policies on subset
> +	ranges of the shared object.  However, Linux still splits the VMA of
> +	the task that installs the policy for each range of distinct policy.
> +	Thus, different tasks that attach to a shared memory segment can have
> +	different VMA configurations mapping that one shared object.
> +
> +Components of Memory Policies
> +
> +    A Linux memory policy is a tuple consisting of a "mode" and an optional set
> +    of nodes.  The mode determine the behavior of the policy, while the optional

                           determines

> +    set of nodes can be viewed as the arguments to the behavior.
> +
> +   Internally, memory policies are implemented by a reference counted structure,
> +   struct mempolicy.  Details of this structure will be discussed in context,
> +   below.
> +
> +	Note:  in some functions AND in the struct mempolicy, the mode is
> +	called "policy".  However, to avoid confusion with the policy tuple,
> +	this document will continue to use the term "mode".
> +
> +   Linux memory policy supports the following 4 modes:
> +
> +	Default Mode--MPOL_DEFAULT:  The behavior specified by this mode is
> +	context dependent.
> +
> +	    During normal system operation, the system default policy is hard
> +	    coded to contain the Default mode.  During system boot up, the
> +	    system default policy is temporarily set to MPOL_INTERLEAVE [see
> +	    below] to distribute boot time allocations across all nodes in
> +	    the system, instead of using just the node containing the boot cpu.
> +
> +	    In this context, default mode means "local" allocation--that is
> +	    attempt to allocate the page from the node associated with the cpu
> +	    where the fault occurs.  If the "local" node has no memory, or the
> +	    node's memory can be exhausted [no free pages available], local
> +	    allocation will attempt to allocate pages from "nearby" nodes, using
> +	    a per node list of nodes--called zonelists--built at boot time, or
> +	    when nodes or memory are added or removed from the system [memory
> +	    hotplug].
> +
> +	    When a task/process policy or a shared policy contains the Default
> +	    mode, this also means local allocation, as described above.
> +
> +	    In the context of a VMA, Default mode means "fall back to task
> +	    policy"--which may or may not specify Default mode.  Thus, Default
> +	    mode can not be counted on to mean local allocation when used

                 cannot

> +	    on a non-shared region of the address space.  However, see
> +	    MPOL_PREFERRED below.
> +
> +	    The Default mode does not use the optional set of nodes.
> +
> +	MPOL_BIND:  This mode specifies that memory must come from the
> +	set of nodes specified by the policy.  The kernel builds a custom
> +	zonelist pointed to by the zonelist member of struct mempolicy,
> +	containing just the nodes specified by the Bind policy.  If the kernel
> +	is unable to allocate a page from the first node in the custom zonelist,
> +	it moves on to the next, and so forth.  If it is unable to allocate a
> +	page from any of the nodes in this list, the allocation will fail.
> +
> +	    The memory policy APIs do not specify an order in which the nodes
> +	    will be searched.  However, unlike the per node zonelists mentioned
> +	    above, the custom zonelist for the Bind policy do not consider the

                                                           does not

> +	    distance between the nodes.  Rather, the lists are built in order
> +	    of numeric node id.
> +
> +	MPOL_PREFERRED:  This mode specifies that the allocation should be
> +	attempted from the single node specified in the policy.  If that
> +	allocation fails, the kernel will search other nodes, exactly as
> +	it would for a local allocation that started at the preferred node--
> +	that is, using the per-node zonelists in increasing distance from
> +	the preferred node.
> +
> +	    Internally, the Preferred policy uses a single node--the
> +	    preferred_node member of struct mempolicy.
> +
> +	    If the Preferred policy node is '-1', then at page allocation time,
> +	    the kernel will use the "local node" as the starting point for the
> +	    allocation.  This is the way to specify local allocation for a
> +	    specific range of addresses--i.e. for VMA policies.
> +
> +	MPOL_INTERLEAVED:  This mode specifies that page allocations be
> +	interleaved, on a page granularity, across the nodes specified in
> +	the policy.  This mode also behaves slightly differently, based on
> +	the context where it is used:
...
> +
> +MEMORY POLICIES AND CPUSETS
> +
> +Memory policies work within cpusets as described above.  For memory policies
> +that require a node or set of nodes, the nodes are restricted to the set of
> +nodes whose memories are allowed by the cpuset constraints.  This can be
> +problematic for 2 reasons:
> +
> +1) the memory policy APIs take physical node id's as arguments.  However, the
> +   memory policy APIs do not provide a way to determine what nodes are valid
> +   in the context where the application is running.  An application MAY consult
> +   the cpuset file system [directly or via an out of tree, and not generally
> +   available, libcpuset API] to obtain this information, but then the
> +   application must be aware that it is running in a cpuset and use what are
> +   intended primarily as administrative APIs.
> +
> +2) when tasks in two cpusets share access to a memory region, such as shared
> +   memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and

                                          or (?)

> +   MAP_SHARED flags, only nodes whose memories are allowed in both cpusets
> +   may be used in the policies.  Again, obtaining this information requires
> +   "stepping outside" the memory policy APIs to use the cpuset information.
> +   Furthermore, if the cpusets' "allowed memory" sets are disjoint, "local"
> +   allocation is the only valid policy.
> +
> +MEMORY POLICY APIs
> +
> +Linux supports 3 system calls for controlling memory policy.  These APIS
> +always affect only the calling task, the calling task's address space, or
> +some shared object mapped into the calling task's address space.
> +
> +	Note:  the headers that define these APIs and the parameter data types
> +	for user space applications reside in a package that is not part of
> +	the Linux kernel.  The kernel system call interfaces, with the 'sys_'
> +	prefix, are defined in <linux/syscalls.h>; the mode and flag
> +	definitions are defined in <linux/mempolicy.h>.
> +
> +Set [Task] Memory Policy:
> +
> +	long set_mempolicy(int mode, const unsigned long *nmask,
> +					unsigned long maxnode);
> +
> +	Set's the calling task's "task/process memory policy" to mode
> +	specified by the 'mode' argument and the set of nodes defined
> +	by 'nmask'.  'nmask' points to a bit mask of node ids containing
> +	at least 'maxnode' ids.
> +
> +	See the set_mempolicy(2) man page for more details

                                                           .

> +
> +Get [Task] Memory Policy or Related Information
> +
> +	long get_mempolicy(int *mode,
> +			   const unsigned long *nmask, unsigned long maxnode,
> +			   void *addr, int flags);
> +
> +	Queries the "task/process memory policy" of the calling task, or
> +	the policy or location of a specified virtual address, depending
> +	on the 'flags' argument.
> +
> +	See the get_mempolicy(2) man page for more details

                                                          .

> +
> +Install VMA/Shared Policy for a Range of Task's Address Space
> +
> +	long mbind(void *start, unsigned long len, int mode,
> +		   const unsigned long *nmask, unsigned long maxnode,
> +		   unsigned flags);
> +
> +	mbind() installs the policy specified by (mode, nmask, maxnodes) as
> +	a VMA policy for the range of the calling task's address space
> +	specified by the 'start' and 'len' arguments.  Additional actions
> +	may be requested via the 'flags' argument.
> +
> +	See the mbind(2) man page for more details.


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] Document Linux Memory Policy - V2
  2007-07-27 18:00                   ` [PATCH] Document Linux Memory Policy - V2 Lee Schermerhorn
  2007-07-27 18:38                     ` Randy Dunlap
@ 2007-07-27 18:55                     ` Christoph Lameter
  2007-07-27 19:24                       ` Lee Schermerhorn
  2007-07-31 15:14                     ` Mel Gorman
  2 siblings, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2007-07-27 18:55 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, ak, Mel Gorman, KAMEZAWA Hiroyuki, akpm, pj,
	Michael Kerrisk, Randy Dunlap, Eric Whitney

On Fri, 27 Jul 2007, Lee Schermerhorn wrote:

> +    Shared Policy:  This policy applies to "memory objects" mapped shared into
> +    one or more tasks' distinct address spaces.  Shared policies are applied
> +    directly to the shared object.  Thus, all tasks that attach to the object
> +    share the policy, and all pages allocated for the shared object, by any
> +    task, will obey the shared policy.

This applies to shmem only not to shared memory. Shared memory can also 
come about by mmapping a file etc. Its better to describe shmem 
as an exceptional situation later and warn of the surprises coming with 
the use of memory policies on shmem in a separate section.

> +	MPOL_BIND:  This mode specifies that memory must come from the
> +	set of nodes specified by the policy.  The kernel builds a custom
> +	zonelist pointed to by the zonelist member of struct mempolicy,
> +	containing just the nodes specified by the Bind policy.  If the kernel
> +	is unable to allocate a page from the first node in the custom zonelist,
> +	it moves on to the next, and so forth.  If it is unable to allocate a
> +	page from any of the nodes in this list, the allocation will fail.

The implementation details may not be useful to explain here and may 
change soon. Maybe just describe the effect?

> +	    The memory policy APIs do not specify an order in which the nodes
> +	    will be searched.  However, unlike the per node zonelists mentioned
> +	    above, the custom zonelist for the Bind policy do not consider the
> +	    distance between the nodes.  Rather, the lists are built in order
> +	    of numeric node id.

Yea another reson to get the nodemask as a parameter for alloc_pages().

> +2) when tasks in two cpusets share access to a memory region, such as shared
> +   memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and
> +   MAP_SHARED flags, only nodes whose memories are allowed in both cpusets
> +   may be used in the policies.  Again, obtaining this information requires
> +   "stepping outside" the memory policy APIs to use the cpuset information.
> +   Furthermore, if the cpusets' "allowed memory" sets are disjoint, "local"
> +   allocation is the only valid policy.

In general this works fine with a shared mapping via mmap (which is much 
more common). The problem exists if one uses shmem with the strange shared 
semantics.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] Document Linux Memory Policy - V2
  2007-07-27 18:38                     ` Randy Dunlap
@ 2007-07-27 19:01                       ` Lee Schermerhorn
  2007-07-27 19:21                         ` Randy Dunlap
  0 siblings, 1 reply; 60+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 19:01 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-mm, Christoph Lameter, ak, Mel Gorman, KAMEZAWA Hiroyuki,
	akpm, pj, Michael Kerrisk, Eric Whitney

On Fri, 2007-07-27 at 11:38 -0700, Randy Dunlap wrote:
> On Fri, 27 Jul 2007 14:00:59 -0400 Lee Schermerhorn wrote:
> 
> > [PATCH] Document Linux Memory Policy - V2
> > 
> >  Documentation/vm/memory_policy.txt |  278 +++++++++++++++++++++++++++++++++++++
> >  1 file changed, 278 insertions(+)
> > 
> > Index: Linux/Documentation/vm/memory_policy.txt
> > ===================================================================
> > --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> > +++ Linux/Documentation/vm/memory_policy.txt	2007-07-27 13:40:45.000000000 -0400
> > @@ -0,0 +1,278 @@
> > +
> 
> ...
> 
> > +
> > +MEMORY POLICY CONCEPTS
> > +
> > +Scope of Memory Policies
> > +
> > +The Linux kernel supports four more or less distinct scopes of memory policy:
> > +
> > +    System Default Policy:  this policy is "hard coded" into the kernel.  It
> > +    is the policy that governs the all page allocations that aren't controlled
> 
>                               drop ^ "the"
> 
> > +    by one of the more specific policy scopes discussed below.
> 
> Are these policies listed in order of "less specific scope to more
> specific scope"?

Randy:

Thanks for the quick review.   I will make the edits you suggest and
re-post after the weekend [hoping for more feedback...].

To answer your question, yes, it was my intent to order them from least
specific [or most general?] to most specific.  Shall I say so?

Other than these items, does the document make sense?  Do you think it's
worth adding?  Andi was concerned about having documentation in too many
places [code + doc].

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] Document Linux Memory Policy - V2
  2007-07-27 19:01                       ` Lee Schermerhorn
@ 2007-07-27 19:21                         ` Randy Dunlap
  0 siblings, 0 replies; 60+ messages in thread
From: Randy Dunlap @ 2007-07-27 19:21 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, Christoph Lameter, ak, Mel Gorman, KAMEZAWA Hiroyuki,
	akpm, pj, Michael Kerrisk, Eric Whitney

On Fri, 27 Jul 2007 15:01:28 -0400 Lee Schermerhorn wrote:

> On Fri, 2007-07-27 at 11:38 -0700, Randy Dunlap wrote:
> > On Fri, 27 Jul 2007 14:00:59 -0400 Lee Schermerhorn wrote:
> > 
> > > [PATCH] Document Linux Memory Policy - V2
> > > 
> > >  Documentation/vm/memory_policy.txt |  278 +++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 278 insertions(+)
> > > 
> > > Index: Linux/Documentation/vm/memory_policy.txt
> > > ===================================================================
> > > --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> > > +++ Linux/Documentation/vm/memory_policy.txt	2007-07-27 13:40:45.000000000 -0400
> > > @@ -0,0 +1,278 @@
> > > +
> > 
> > ...
> > 
> > > +
> > > +MEMORY POLICY CONCEPTS
> > > +
> > > +Scope of Memory Policies
> > > +
> > > +The Linux kernel supports four more or less distinct scopes of memory policy:
> > > +
> > > +    System Default Policy:  this policy is "hard coded" into the kernel.  It
> > > +    is the policy that governs the all page allocations that aren't controlled
> > 
> >                               drop ^ "the"
> > 
> > > +    by one of the more specific policy scopes discussed below.
> > 
> > Are these policies listed in order of "less specific scope to more
> > specific scope"?
> 
> Randy:
> 
> Thanks for the quick review.   I will make the edits you suggest and
> re-post after the weekend [hoping for more feedback...].

Sure.

> To answer your question, yes, it was my intent to order them from least
> specific [or most general?] to most specific.  Shall I say so?

Yes.  I would.

> Other than these items, does the document make sense?  Do you think it's
> worth adding?  Andi was concerned about having documentation in too many
> places [code + doc].

Yes, I think that it's worth adding and makes sense, although
Christoph's comment about documenting effects instead of internal
workings also makes sense to me.  That would also tend to mitigate
Andi's concern a bit.

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] Document Linux Memory Policy - V2
  2007-07-27 18:55                     ` Christoph Lameter
@ 2007-07-27 19:24                       ` Lee Schermerhorn
  0 siblings, 0 replies; 60+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 19:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, ak, Mel Gorman, KAMEZAWA Hiroyuki, akpm, pj,
	Michael Kerrisk, Randy Dunlap, Eric Whitney

On Fri, 2007-07-27 at 11:55 -0700, Christoph Lameter wrote:
> On Fri, 27 Jul 2007, Lee Schermerhorn wrote:
> 
> > +    Shared Policy:  This policy applies to "memory objects" mapped shared into
> > +    one or more tasks' distinct address spaces.  Shared policies are applied
> > +    directly to the shared object.  Thus, all tasks that attach to the object
> > +    share the policy, and all pages allocated for the shared object, by any
> > +    task, will obey the shared policy.
> 
> This applies to shmem only not to shared memory. Shared memory can also 
> come about by mmapping a file etc. Its better to describe shmem 
> as an exceptional situation later and warn of the surprises coming with 
> the use of memory policies on shmem in a separate section.

I do explain that later in the doc.  I'll see if I can reword it to pull
that up here.

> 
> > +	MPOL_BIND:  This mode specifies that memory must come from the
> > +	set of nodes specified by the policy.  The kernel builds a custom
> > +	zonelist pointed to by the zonelist member of struct mempolicy,
> > +	containing just the nodes specified by the Bind policy.  If the kernel
> > +	is unable to allocate a page from the first node in the custom zonelist,
> > +	it moves on to the next, and so forth.  If it is unable to allocate a
> > +	page from any of the nodes in this list, the allocation will fail.
> 
> The implementation details may not be useful to explain here and may 
> change soon. Maybe just describe the effect?

I wanted to explain it to contrast to node zonelists and as context for
the next paragraph.  I think the notion of custom zonelists is important
in the current implementation.  And, I plan to keep this up to date with
the forth coming changes.  Maybe it'll change before this even get's
merged into Linus' tree.  But, if I could get this into -mm, I can
submit update patches making it clear what changed when.

> 
> > +	    The memory policy APIs do not specify an order in which the nodes
> > +	    will be searched.  However, unlike the per node zonelists mentioned
> > +	    above, the custom zonelist for the Bind policy do not consider the
> > +	    distance between the nodes.  Rather, the lists are built in order
> > +	    of numeric node id.
> 
> Yea another reson to get the nodemask as a parameter for alloc_pages().

OK.  Again, just wanted to make current behavior explicit.  Will update
when it changes.

> 
> > +2) when tasks in two cpusets share access to a memory region, such as shared
> > +   memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and
> > +   MAP_SHARED flags, only nodes whose memories are allowed in both cpusets
> > +   may be used in the policies.  Again, obtaining this information requires
> > +   "stepping outside" the memory policy APIs to use the cpuset information.
> > +   Furthermore, if the cpusets' "allowed memory" sets are disjoint, "local"
> > +   allocation is the only valid policy.
> 
> In general this works fine with a shared mapping via mmap (which is much 
> more common). The problem exists if one uses shmem with the strange shared 
> semantics.

If the shared mapping is with MAP_ANONYMOUS, I believe that you get
"shmem"--same issues as with "shm" [SysV shared memory].  It works
"fine" [your definition, I guess] for shared, mmap()ed files because the
policy doesn't get applied to the object and the vma policy is ignored.
As far as the shared policy semantics being "strange", let's not restart
that, uh, "discussion" in this thread.  I've tried to avoid that topic
in this document, and just describe the concepts/design/behavior in the
interest of getting a baseline document.  That said, undoubtedly my bias
sneaks through in places.

As I mentioned to Randy, I'll make another pass after weekend.

Have a good one,
Lee  



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-27 15:45               ` Mel Gorman
  2007-07-27 17:35                 ` Christoph Lameter
@ 2007-07-28  7:28                 ` KAMEZAWA Hiroyuki
  2007-07-28 11:57                   ` Mel Gorman
  1 sibling, 1 reply; 60+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-07-28  7:28 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Lameter, linux-mm, Lee Schermerhorn, ak, akpm, pj

On Fri, 27 Jul 2007 16:45:19 +0100
mel@skynet.ie (Mel Gorman) wrote:

> Obvious things that are outstanding;
> 
> o Compile-test parisc
> o Split patch in two to keep the zone_idx changes separetly
> o Verify zlccache is not broken
> o Have a version of __alloc_pages take a nodemask and ditch
>   bind_zonelist()
> 
> I can work on bringing this up to scratch during the cycle.
> 
> Patch as follows. Comments?
> 

I like this idea in general. My concern is zonelist scan cost.
 Hmm, can this be help ?

---
 include/linux/mmzone.h |    1 
 mm/page_alloc.c        |   51 +++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 50 insertions(+), 2 deletions(-)

Index: linux-2.6.23-rc1.test/include/linux/mmzone.h
===================================================================
--- linux-2.6.23-rc1.test.orig/include/linux/mmzone.h
+++ linux-2.6.23-rc1.test/include/linux/mmzone.h
@@ -406,6 +406,7 @@ struct zonelist_cache;
 
 struct zonelist {
 	struct zonelist_cache *zlcache_ptr;		     // NULL or &zlcache
+	unsigned short gfp_skip[MAX_NR_ZONES];
 	struct zone *zones[MAX_ZONES_PER_ZONELIST + 1];      // NULL delimited
 #ifdef CONFIG_NUMA
 	struct zonelist_cache zlcache;			     // optional ...
Index: linux-2.6.23-rc1.test/mm/page_alloc.c
===================================================================
--- linux-2.6.23-rc1.test.orig/mm/page_alloc.c
+++ linux-2.6.23-rc1.test/mm/page_alloc.c
@@ -1158,13 +1158,14 @@ get_page_from_freelist(gfp_t gfp_mask, u
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 	enum zone_type highest_zoneidx = gfp_zone(gfp_mask);
+	int default_skip = zonelist->gfp_skip[highest_zoneidx];
 
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	z = zonelist->zones;
+	z = zonelist->zones + default_skip;
 
 	do {
 		if (should_filter_zone(*z, highest_zoneidx))
@@ -1235,6 +1236,7 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
 	int do_retry;
 	int alloc_flags;
 	int did_some_progress;
+	int gfp_skip = zonelist->gfp_skip[gfp_zone(gfp_mask)];
 
 	might_sleep_if(wait);
 
@@ -1265,7 +1267,7 @@ restart:
 	if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
 		goto nopage;
 
-	for (z = zonelist->zones; *z; z++)
+	for (z = zonelist->zones + gfp_skip; *z; z++)
 		wakeup_kswapd(*z, order);
 
 	/*
@@ -2050,6 +2052,50 @@ static void build_zonelist_cache(pg_data
 
 #endif	/* CONFIG_NUMA */
 
+static inline 
+unsigned short find_first_zone(enum zone_type target, struct zonelist *zl)
+{
+	unsigned short index = 0;
+	struct zone *z;
+	z = zl->zones[index];
+	while (z != NULL) {
+		if (!should_filter_zone(z, target))
+			return index;
+		z = zl->zones[++index];
+	}
+	return 0;
+}
+/*
+ * record the first available zone per gfp.
+ */
+
+static void build_zonelist_skip(pg_data_t *pgdat)
+{
+	enum zone_type target;
+	unsigned short index;
+	struct zonelist *zl = &pgdat->node_zonelist;
+
+	target = gfp_zone(GFP_KERNEL|GFP_DMA);
+	index = find_first_zone(target, zl);
+	zl->gfp_skip[target] = index;
+
+	target = gfp_zone(GFP_KERNEL|GFP_DMA32);
+	index = find_first_zone(target, zl);
+	zl->gfp_skip[target] = index;
+
+	target = gfp_zone(GFP_KERNEL);
+	index = find_first_zone(target, zl);
+	zl->gfp_skip[target] = index;
+
+	target = gfp_zone(GFP_HIGHUSER);
+	index = find_first_zone(target, zl);
+	zl->gfp_skip[target] = index;
+
+	target = gfp_zone(GFP_HIGHUSER_MOVABLE);
+	index = find_first_zone(target, zl);
+	zl->gfp_skip[target] = index;
+}
+
 /* return values int ....just for stop_machine_run() */
 static int __build_all_zonelists(void *dummy)
 {
@@ -2058,6 +2104,7 @@ static int __build_all_zonelists(void *d
 	for_each_online_node(nid) {
 		build_zonelists(NODE_DATA(nid));
 		build_zonelist_cache(NODE_DATA(nid));
+		build_zonelist_skip(NODE_DATA(nid));
 	}
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-28  7:28                 ` NUMA policy issues with ZONE_MOVABLE KAMEZAWA Hiroyuki
@ 2007-07-28 11:57                   ` Mel Gorman
  2007-07-28 14:10                     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 60+ messages in thread
From: Mel Gorman @ 2007-07-28 11:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Lameter, Linux Memory Management List,
	Lee Schermerhorn, ak, akpm, pj

On Sat, 28 Jul 2007, KAMEZAWA Hiroyuki wrote:

> On Fri, 27 Jul 2007 16:45:19 +0100
> mel@skynet.ie (Mel Gorman) wrote:
>
>> Obvious things that are outstanding;
>>
>> o Compile-test parisc
>> o Split patch in two to keep the zone_idx changes separetly
>> o Verify zlccache is not broken
>> o Have a version of __alloc_pages take a nodemask and ditch
>>   bind_zonelist()
>>
>> I can work on bringing this up to scratch during the cycle.
>>
>> Patch as follows. Comments?
>>
>
> I like this idea in general. My concern is zonelist scan cost.
> Hmm, can this be help ?
>

Does this not make the assumption that the zonelists are in zone-order as 
opposed to node? i.e. that is is

H1N1D1H2N2D2H3N3D3 instead of
H1H2H3N1N2N3D1D2D3

If it's node-order, does this scheme break?

> ---
> include/linux/mmzone.h |    1
> mm/page_alloc.c        |   51 +++++++++++++++++++++++++++++++++++++++++++++++--
> 2 files changed, 50 insertions(+), 2 deletions(-)
>
> Index: linux-2.6.23-rc1.test/include/linux/mmzone.h
> ===================================================================
> --- linux-2.6.23-rc1.test.orig/include/linux/mmzone.h
> +++ linux-2.6.23-rc1.test/include/linux/mmzone.h
> @@ -406,6 +406,7 @@ struct zonelist_cache;
>
> struct zonelist {
> 	struct zonelist_cache *zlcache_ptr;		     // NULL or &zlcache
> +	unsigned short gfp_skip[MAX_NR_ZONES];
> 	struct zone *zones[MAX_ZONES_PER_ZONELIST + 1];      // NULL delimited
> #ifdef CONFIG_NUMA
> 	struct zonelist_cache zlcache;			     // optional ...
> Index: linux-2.6.23-rc1.test/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.23-rc1.test.orig/mm/page_alloc.c
> +++ linux-2.6.23-rc1.test/mm/page_alloc.c
> @@ -1158,13 +1158,14 @@ get_page_from_freelist(gfp_t gfp_mask, u
> 	int zlc_active = 0;		/* set if using zonelist_cache */
> 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
> 	enum zone_type highest_zoneidx = gfp_zone(gfp_mask);
> +	int default_skip = zonelist->gfp_skip[highest_zoneidx];
>
> zonelist_scan:
> 	/*
> 	 * Scan zonelist, looking for a zone with enough free.
> 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> 	 */
> -	z = zonelist->zones;
> +	z = zonelist->zones + default_skip;
>
> 	do {
> 		if (should_filter_zone(*z, highest_zoneidx))
> @@ -1235,6 +1236,7 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
> 	int do_retry;
> 	int alloc_flags;
> 	int did_some_progress;
> +	int gfp_skip = zonelist->gfp_skip[gfp_zone(gfp_mask)];
>
> 	might_sleep_if(wait);
>
> @@ -1265,7 +1267,7 @@ restart:
> 	if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
> 		goto nopage;
>
> -	for (z = zonelist->zones; *z; z++)
> +	for (z = zonelist->zones + gfp_skip; *z; z++)
> 		wakeup_kswapd(*z, order);
>
> 	/*
> @@ -2050,6 +2052,50 @@ static void build_zonelist_cache(pg_data
>
> #endif	/* CONFIG_NUMA */
>
> +static inline
> +unsigned short find_first_zone(enum zone_type target, struct zonelist *zl)
> +{
> +	unsigned short index = 0;
> +	struct zone *z;
> +	z = zl->zones[index];
> +	while (z != NULL) {
> +		if (!should_filter_zone(z, target))
> +			return index;
> +		z = zl->zones[++index];
> +	}
> +	return 0;
> +}
> +/*
> + * record the first available zone per gfp.
> + */
> +
> +static void build_zonelist_skip(pg_data_t *pgdat)
> +{
> +	enum zone_type target;
> +	unsigned short index;
> +	struct zonelist *zl = &pgdat->node_zonelist;
> +
> +	target = gfp_zone(GFP_KERNEL|GFP_DMA);
> +	index = find_first_zone(target, zl);
> +	zl->gfp_skip[target] = index;
> +
> +	target = gfp_zone(GFP_KERNEL|GFP_DMA32);
> +	index = find_first_zone(target, zl);
> +	zl->gfp_skip[target] = index;
> +
> +	target = gfp_zone(GFP_KERNEL);
> +	index = find_first_zone(target, zl);
> +	zl->gfp_skip[target] = index;
> +
> +	target = gfp_zone(GFP_HIGHUSER);
> +	index = find_first_zone(target, zl);
> +	zl->gfp_skip[target] = index;
> +
> +	target = gfp_zone(GFP_HIGHUSER_MOVABLE);
> +	index = find_first_zone(target, zl);
> +	zl->gfp_skip[target] = index;
> +}
> +
> /* return values int ....just for stop_machine_run() */
> static int __build_all_zonelists(void *dummy)
> {
> @@ -2058,6 +2104,7 @@ static int __build_all_zonelists(void *d
> 	for_each_online_node(nid) {
> 		build_zonelists(NODE_DATA(nid));
> 		build_zonelist_cache(NODE_DATA(nid));
> +		build_zonelist_skip(NODE_DATA(nid));
> 	}
> 	return 0;
> }
>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-28 11:57                   ` Mel Gorman
@ 2007-07-28 14:10                     ` KAMEZAWA Hiroyuki
  2007-07-28 14:21                       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 60+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-07-28 14:10 UTC (permalink / raw)
  To: Mel Gorman; +Cc: clameter, linux-mm, Lee.Schermerhorn, ak, akpm, pj

On Sat, 28 Jul 2007 12:57:09 +0100 (IST)
Mel Gorman <mel@csn.ul.ie> wrote:

> > I like this idea in general. My concern is zonelist scan cost.
> > Hmm, can this be help ?
> >
> 
> Does this not make the assumption that the zonelists are in zone-order as 
> opposed to node? i.e. that is is
> 
> H1N1D1H2N2D2H3N3D3 instead of
> H1H2H3N1N2N3D1D2D3
> 
> If it's node-order, does this scheme break?
> 

Maybe no. "skip" will point to the nearest available zone anyway.
But there may be better scheme. This is jus an easy idea.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-28 14:10                     ` KAMEZAWA Hiroyuki
@ 2007-07-28 14:21                       ` KAMEZAWA Hiroyuki
  2007-07-30 12:41                         ` Mel Gorman
  0 siblings, 1 reply; 60+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-07-28 14:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: mel, clameter, linux-mm, Lee.Schermerhorn, ak, akpm, pj

On Sat, 28 Jul 2007 23:10:32 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > H1N1D1H2N2D2H3N3D3 instead of
> > H1H2H3N1N2N3D1D2D3
> > 
> > If it's node-order, does this scheme break?
> > 
> 
> Maybe no. "skip" will point to the nearest available zone anyway.
> But there may be better scheme. This is jus an easy idea.
> 
Assume zonelist on Node0, 

zone order:  M0M1M2M3N0N1N2N3D0 (only node 0 has zone dma)
node order:  M0N0D0M1N1M2N2N3  

GFP_KERNEL for zone_order: skip 4, find N0N2N3D0
GFP_KERNEL for node_order: skip 1, find N0D0N2N3

I'm not sure that this easy trick can show performance benefit.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-28 14:21                       ` KAMEZAWA Hiroyuki
@ 2007-07-30 12:41                         ` Mel Gorman
  2007-07-30 18:06                           ` Christoph Lameter
  0 siblings, 1 reply; 60+ messages in thread
From: Mel Gorman @ 2007-07-30 12:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: clameter, linux-mm, Lee.Schermerhorn, ak, akpm, pj

On Sat, 28 Jul 2007, KAMEZAWA Hiroyuki wrote:

> On Sat, 28 Jul 2007 23:10:32 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>> H1N1D1H2N2D2H3N3D3 instead of
>>> H1H2H3N1N2N3D1D2D3
>>>
>>> If it's node-order, does this scheme break?
>>>
>>
>> Maybe no. "skip" will point to the nearest available zone anyway.
>> But there may be better scheme. This is jus an easy idea.
>>
> Assume zonelist on Node0,
>
> zone order:  M0M1M2M3N0N1N2N3D0 (only node 0 has zone dma)
> node order:  M0N0D0M1N1M2N2N3
>
> GFP_KERNEL for zone_order: skip 4, find N0N2N3D0
> GFP_KERNEL for node_order: skip 1, find N0D0N2N3
>
> I'm not sure that this easy trick can show performance benefit.
>

The results from kernbench were mixed. Small improves on some machines and 
small regressions on others. I'll keep the patch on the stack and 
investigate it further with other benchmarks.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-30 12:41                         ` Mel Gorman
@ 2007-07-30 18:06                           ` Christoph Lameter
  0 siblings, 0 replies; 60+ messages in thread
From: Christoph Lameter @ 2007-07-30 18:06 UTC (permalink / raw)
  To: Mel Gorman; +Cc: KAMEZAWA Hiroyuki, linux-mm, Lee.Schermerhorn, ak, akpm, pj

On Mon, 30 Jul 2007, Mel Gorman wrote:

> The results from kernbench were mixed. Small improves on some machines and
> small regressions on others. I'll keep the patch on the stack and investigate
> it further with other benchmarks.

Optimize the scanning by encodeing the zone type in the pointer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] Document Linux Memory Policy - V2
  2007-07-27 18:00                   ` [PATCH] Document Linux Memory Policy - V2 Lee Schermerhorn
  2007-07-27 18:38                     ` Randy Dunlap
  2007-07-27 18:55                     ` Christoph Lameter
@ 2007-07-31 15:14                     ` Mel Gorman
  2007-07-31 16:34                       ` Lee Schermerhorn
  2 siblings, 1 reply; 60+ messages in thread
From: Mel Gorman @ 2007-07-31 15:14 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, Christoph Lameter, ak, KAMEZAWA Hiroyuki, akpm, pj,
	Michael Kerrisk, Randy Dunlap, Eric Whitney

On (27/07/07 14:00), Lee Schermerhorn didst pronounce:
> Here's a second attempt to document the existing Linux Memory Policy.
> I've tried to address comments on the first cut from Christoph and Andi.
> I've removed the details of the APIs and referenced the man pages "for
> more details".  I've made a stab at addressing the interaction with
> cpusets, but more could be done here.
> 
> I'm hoping we can get this merged in some form, and then update it with
> all of the policy changes that are in the queue and/or being
> worked--memoryless nodes, interaction with ZONE_MOVABLE, ... .
> 
> Lee
> 
> ----------------
> 
> [PATCH] Document Linux Memory Policy - V2
> 
> I couldn't find any memory policy documentation in the Documentation
> directory, so here is my attempt to document it.
> 
> There's lots more that could be written about the internal design--including
> data structures, functions, etc.  However, if you agree that this is better
> that the nothing that exists now, perhaps it could be merged.  This will
> provide a baseline for updates to document the many policy patches that are
> currently being worked.
> 

As pointed out elsewhere, you are better off describing how the policies
appear to behave from outside. If you describe the internals to any decent
level of detail, it'll be obsolete in 6 months time.

> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  Documentation/vm/memory_policy.txt |  278 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 278 insertions(+)
> 
> Index: Linux/Documentation/vm/memory_policy.txt
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ Linux/Documentation/vm/memory_policy.txt	2007-07-27 13:40:45.000000000 -0400
> @@ -0,0 +1,278 @@
> +
> +What is Linux Memory Policy?
> +
> +In the Linux kernel, "memory policy" determines from which node the kernel will
> +allocate memory in a NUMA system or in an emulated NUMA system.  Linux has
> +supported platforms with Non-Uniform Memory Access architectures since 2.4.?.
> +The current memory policy support was added to Linux 2.6 around May 2004.  This
> +document attempts to describe the concepts and APIs of the 2.6 memory policy
> +support.
> +
> +See also Documentation/cpusets.txt which describes a higher level,
> +administrative mechanism for restricting the set of nodes from which memory
> +policy may allocate pages.  Also, see "MEMORY POLICIES AND CPUSETS" below.
> +

hmm. This may conflate what cpusets and memory policies are. Try
something like;

This should not be confused with cpusets (Documentation/cpusets.txt) which
is an administrative mechanism for restricting the usable nodes memory be
allocated from by a set of processes. Memory policies are a programming
interface that a NUMA-aware application can take advantage of. When both
cpusets and policies are applied to a task, the restrictions of the cpuset
takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details.

> +MEMORY POLICY CONCEPTS
> +
> +Scope of Memory Policies
> +
> +The Linux kernel supports four more or less distinct scopes of memory policy:
> +

The sentence is too passive. State with certainity like

The Linux kernel supports four distinct scopes of memory policy:

Otherwise when I'm reading it I feel I must check if there are more or
less than four types of policy.

> +    System Default Policy:  this policy is "hard coded" into the kernel.  It
> +    is the policy that governs the all page allocations that aren't controlled
> +    by one of the more specific policy scopes discussed below.
> +

It's not stated what this policy means until much later. Forward
references like that may be confusing so consider adding something like;

The default policy will allocate from the closest memory node to the currently
running CPU and fallback to nodes in order of distance.

> +    Task/Process Policy:  this is an optional, per-task policy.  When defined
> +    for a specific task, this policy controls all page allocations made by or
> +    on behalf of the task that aren't controlled by a more specific scope.
> +    If a task does not define a task policy, then all page allocations that
> +    would have been controlled by the task policy "fall back" to the System
> +    Default Policy.
> +

Consider reversing the order you are talking about the policies. If you
discuss the policies with more restricted scope and finish with the
default policy, you can avoid future references.

> +	Because task policy applies to the entire address space of a task,
> +	it is inheritable across both fork() [clone() w/o the CLONE_VM flag]
> +	and exec*(). 

Remove "Because" here. The policy is not inherited across fork() because
it applies to the address space. It's because policy is stored in the
task_struct and it's not cleared by fork() or exec().

The use of inheritable here implies that the process must take some
special action for the child to inherit the policy. Is that the case? If
not, say inherited instead of inheritable.

> Thus, a parent task may establish the task policy for
> +	a child task exec()'d from an executable image that has no awareness
> +	of memory policy.
> +
> +	In a multi-threaded task, task policies apply only to the thread
> +	[Linux kernel task] that installs the policy and any threads
> +	subsequently created by that thread.  Any sibling threads existing
> +	at the time a new task policy is installed retain their current
> +	policy.
> +

Is it worth mentioning numactl here?

> +	A task policy applies only to pages allocated after the policy is
> +	installed.  Any pages already faulted in by the task remain where
> +	they were allocated based on the policy at the time they were
> +	allocated.
> +
> +    VMA Policy:  A "VMA" or "Virtual Memory Area" refers to a range of a task's
> +    virtual adddress space.  A task may define a specific policy for a range
> +    of its virtual address space.  This VMA policy will govern the allocation
> +    of pages that back this region of the address space.  Any regions of the
> +    task's address space that don't have an explicit VMA policy will fall back
> +    to the task policy, which may itself fall back to the system default policy.
> +
> +	VMA policy applies ONLY to anonymous pages.  These include pages
> +	allocated for anonymous segments, such as the task stack and heap, and
> +	any regions of the address space mmap()ed with the MAP_ANONYMOUS flag.
> +	Anonymous pages copied from private file mappings [files mmap()ed with
> +	the MAP_PRIVATE flag] also obey VMA policy, if defined.
> +

The last sentence is confusing. Does it mean that policies can be
applied to file mappings but only if they are MAP_PRIVATE and the policy
only comes into play during COW?

> +	VMA policies are shared between all tasks that share a virtual address
> +	space--a.k.a. threads--independent of when the policy is installed; and
> +	they are inherited across fork().  However, because VMA policies refer
> +	to a specific region of a task's address space, and because the address
> +	space is discarded and recreated on exec*(), VMA policies are NOT
> +	inheritable across exec().  Thus, only NUMA-aware applications may
> +	use VMA policies.
> +
> +	A task may install a new VMA policy on a sub-range of a previously
> +	mmap()ed region.  When this happens, Linux splits the existing virtual
> +	memory area into 2 or 3 VMAs, each with it's own policy.
> +
> +	By default, VMA policy applies only to pages allocated after the policy
> +	is installed.  Any pages already faulted into the VMA range remain where
> +	they were allocated based on the policy at the time they were
> +	allocated.  However, since 2.6.16, Linux supports page migration so
> +	that page contents can be moved to match a newly installed policy.
> +

State what system call is needed for the migration to take place.

> +    Shared Policy:  This policy applies to "memory objects" mapped shared into
> +    one or more tasks' distinct address spaces.  Shared policies are applied
> +    directly to the shared object.  Thus, all tasks that attach to the object
> +    share the policy, and all pages allocated for the shared object, by any
> +    task, will obey the shared policy.
> +
> +	Currently [2.6.22], only shared memory segments, created by shmget(),
> +	support shared policy. 

This appears to contradict the previous paragram. The last paragraph
would imply that the policy is applied to mappings that are mmaped
MAP_SHARED where they really only apply to shmem mappings.

> +	When shared policy support was added to Linux,
> +	the associated data structures were added to shared hugetlbfs segments.
> +	However, at the time, hugetlbfs did not support allocation at fault
> +	time--a.k.a lazy allocation--so hugetlbfs segments were never "hooked
> +	up" to the shared policy support.  Although hugetlbfs segments now
> +	support lazy allocation, their support for shared policy has not been
> +	completed.
> +
> +	Although internal to the kernel shared memory segments are really
> +	files backed by swap space that have been mmap()ed shared into tasks'
> +	address spaces, regular files mmap()ed shared do NOT support shared
> +	policy.  Rather, shared page cache pages, including pages backing
> +	private mappings that have not yet been written by the task, follow
> +	task policy, if any, else system default policy.
> +
> +	The shared policy infrastructure supports different policies on subset
> +	ranges of the shared object.  However, Linux still splits the VMA of
> +	the task that installs the policy for each range of distinct policy.
> +	Thus, different tasks that attach to a shared memory segment can have
> +	different VMA configurations mapping that one shared object.
> +
> +Components of Memory Policies
> +
> +    A Linux memory policy is a tuple consisting of a "mode" and an optional set
> +    of nodes.  The mode determine the behavior of the policy, while the optional
> +    set of nodes can be viewed as the arguments to the behavior.
> +
> +   Internally, memory policies are implemented by a reference counted structure,
> +   struct mempolicy.  Details of this structure will be discussed in context,
> +   below.
> +
> +	Note:  in some functions AND in the struct mempolicy, the mode is
> +	called "policy".  However, to avoid confusion with the policy tuple,
> +	this document will continue to use the term "mode".
> +
> +   Linux memory policy supports the following 4 modes:
> +
> +	Default Mode--MPOL_DEFAULT:  The behavior specified by this mode is
> +	context dependent.
> +
> +	    During normal system operation, the system default policy is hard
> +	    coded to contain the Default mode.  During system boot up, the
> +	    system default policy is temporarily set to MPOL_INTERLEAVE [see
> +	    below] to distribute boot time allocations across all nodes in
> +	    the system, instead of using just the node containing the boot cpu.
> +
> +	    In this context, default mode means "local" allocation--that is
> +	    attempt to allocate the page from the node associated with the cpu
> +	    where the fault occurs.  If the "local" node has no memory, or the
> +	    node's memory can be exhausted [no free pages available], local
> +	    allocation will attempt to allocate pages from "nearby" nodes, using
> +	    a per node list of nodes--called zonelists--built at boot time, or
> +	    when nodes or memory are added or removed from the system [memory
> +	    hotplug].
> +
> +	    When a task/process policy or a shared policy contains the Default
> +	    mode, this also means local allocation, as described above.
> +
> +	    In the context of a VMA, Default mode means "fall back to task
> +	    policy"--which may or may not specify Default mode.  Thus, Default
> +	    mode can not be counted on to mean local allocation when used
> +	    on a non-shared region of the address space.  However, see
> +	    MPOL_PREFERRED below.
> +
> +	    The Default mode does not use the optional set of nodes.
> +
> +	MPOL_BIND:  This mode specifies that memory must come from the
> +	set of nodes specified by the policy.  The kernel builds a custom
> +	zonelist pointed to by the zonelist member of struct mempolicy,
> +	containing just the nodes specified by the Bind policy.  If the kernel

Omit the implementation details here. Even now it is being considered to
have just one zonelist per-node that is filtered based on the allocation
requirements. For MPOL_BIND, this would involve __alloc_pages() taking a
nodemask and ignoring nodes not allowed by the mask.

It's sufficent to say that MPOL_BIND will restrict the process to allocating
pages within a set of nodes specified by a nodemask because the end result
from the external observer will be similar.

> +	is unable to allocate a page from the first node in the custom zonelist,
> +	it moves on to the next, and so forth.  If it is unable to allocate a
> +	page from any of the nodes in this list, the allocation will fail.
> +
> +	    The memory policy APIs do not specify an order in which the nodes
> +	    will be searched.  However, unlike the per node zonelists mentioned
> +	    above, the custom zonelist for the Bind policy do not consider the
> +	    distance between the nodes.  Rather, the lists are built in order
> +	    of numeric node id.
> +

Omit the last part as well because if we were filtering nodes based on a
mask as described above, then MPOL_BIND would actually behave similar to
the default policy except that is uses a subset of the available nodes.
Arguably that is more sensible behaviour for MPOL_BIND than what it does today.

> +	MPOL_PREFERRED:  This mode specifies that the allocation should be
> +	attempted from the single node specified in the policy.  If that
> +	allocation fails, the kernel will search other nodes, exactly as
> +	it would for a local allocation that started at the preferred node--
> +	that is, using the per-node zonelists in increasing distance from
> +	the preferred node.
> +
> +	    Internally, the Preferred policy uses a single node--the
> +	    preferred_node member of struct mempolicy.
> +
> +	    If the Preferred policy node is '-1', then at page allocation time,
> +	    the kernel will use the "local node" as the starting point for the
> +	    allocation.  This is the way to specify local allocation for a
> +	    specific range of addresses--i.e. for VMA policies.
> +

Again, consider omitting the implementation details here. They don't
help as such.

> +	MPOL_INTERLEAVED:  This mode specifies that page allocations be
> +	interleaved, on a page granularity, across the nodes specified in
> +	the policy.  This mode also behaves slightly differently, based on
> +	the context where it is used:
> +
> +	    For allocation of anonymous pages and shared memory pages,
> +	    Interleave mode indexes the set of nodes specified by the policy
> +	    using the page offset of the faulting address into the segment
> +	    [VMA] containing the address modulo the number of nodes specified
> +	    by the policy.  It then attempts to allocate a page, starting at
> +	    the selected node, as if the node had been specified by a Preferred
> +	    policy or had been selected by a local allocation.  That is,
> +	    allocation will follow the per node zonelist.
> +
> +	    For allocation of page cache pages, Interleave mode indexes the set
> +	    of nodes specified by the policy using a node counter maintained
> +	    per task.  This counter wraps around to the lowest specified node
> +	    after it reaches the highest specified node.  This will tend to
> +	    spread the pages out over the nodes specified by the policy based
> +	    on the order in which they are allocated, rather than based on any
> +	    page offset into an address range or file.  During system boot up,
> +	    the temporary interleaved system default policy works in this
> +	    mode.
> +

Oddly, these implementation details are really useful. Keep this one here
but it would be great if they were in the manual pages.

> +MEMORY POLICIES AND CPUSETS
> +
> +Memory policies work within cpusets as described above.  For memory policies
> +that require a node or set of nodes, the nodes are restricted to the set of
> +nodes whose memories are allowed by the cpuset constraints.  This can be
> +problematic for 2 reasons:
> +
> +1) the memory policy APIs take physical node id's as arguments.  However, the
> +   memory policy APIs do not provide a way to determine what nodes are valid
> +   in the context where the application is running.  An application MAY consult
> +   the cpuset file system [directly or via an out of tree, and not generally
> +   available, libcpuset API] to obtain this information, but then the
> +   application must be aware that it is running in a cpuset and use what are
> +   intended primarily as administrative APIs.
> +
> +2) when tasks in two cpusets share access to a memory region, such as shared
> +   memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and
> +   MAP_SHARED flags, only nodes whose memories are allowed in both cpusets
> +   may be used in the policies.  Again, obtaining this information requires
> +   "stepping outside" the memory policy APIs to use the cpuset information.
> +   Furthermore, if the cpusets' "allowed memory" sets are disjoint, "local"
> +   allocation is the only valid policy.
> +

Consider moving this section to the end. It reads better to keep the discussion
in the context of policies for as long as possible. Otherwise it's

Section 1: policies
Section 2: policies
Section 3: policies + cpusets
Section 4: policies

> +MEMORY POLICY APIs
> +
> +Linux supports 3 system calls for controlling memory policy.  These APIS

s/APIS/APIs/

> +always affect only the calling task, the calling task's address space, or
> +some shared object mapped into the calling task's address space.
> +
> +	Note:  the headers that define these APIs and the parameter data types
> +	for user space applications reside in a package that is not part of
> +	the Linux kernel.  The kernel system call interfaces, with the 'sys_'
> +	prefix, are defined in <linux/syscalls.h>; the mode and flag
> +	definitions are defined in <linux/mempolicy.h>.
> +
> +Set [Task] Memory Policy:
> +
> +	long set_mempolicy(int mode, const unsigned long *nmask,
> +					unsigned long maxnode);
> +
> +	Set's the calling task's "task/process memory policy" to mode
> +	specified by the 'mode' argument and the set of nodes defined
> +	by 'nmask'.  'nmask' points to a bit mask of node ids containing
> +	at least 'maxnode' ids.
> +
> +	See the set_mempolicy(2) man page for more details
> +
> +
> +Get [Task] Memory Policy or Related Information
> +
> +	long get_mempolicy(int *mode,
> +			   const unsigned long *nmask, unsigned long maxnode,
> +			   void *addr, int flags);
> +
> +	Queries the "task/process memory policy" of the calling task, or
> +	the policy or location of a specified virtual address, depending
> +	on the 'flags' argument.
> +
> +	See the get_mempolicy(2) man page for more details
> +
> +
> +Install VMA/Shared Policy for a Range of Task's Address Space
> +
> +	long mbind(void *start, unsigned long len, int mode,
> +		   const unsigned long *nmask, unsigned long maxnode,
> +		   unsigned flags);
> +
> +	mbind() installs the policy specified by (mode, nmask, maxnodes) as
> +	a VMA policy for the range of the calling task's address space
> +	specified by the 'start' and 'len' arguments.  Additional actions
> +	may be requested via the 'flags' argument.
> +
> +	See the mbind(2) man page for more details.
> 

Despite the comments, this is good work and really useful. I'd be fairly
happy with it even without further revisions. Thanks a lot for the read.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] Document Linux Memory Policy - V2
  2007-07-31 15:14                     ` Mel Gorman
@ 2007-07-31 16:34                       ` Lee Schermerhorn
  2007-07-31 19:10                         ` Christoph Lameter
  2007-07-31 20:48                         ` [PATCH] Document Linux Memory Policy - V3 Lee Schermerhorn
  0 siblings, 2 replies; 60+ messages in thread
From: Lee Schermerhorn @ 2007-07-31 16:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Christoph Lameter, ak, KAMEZAWA Hiroyuki, akpm, pj,
	Michael Kerrisk, Randy Dunlap, Eric Whitney

Thanks for the review, Mel.

On Tue, 2007-07-31 at 16:14 +0100, Mel Gorman wrote:
> On (27/07/07 14:00), Lee Schermerhorn didst pronounce:
> > Here's a second attempt to document the existing Linux Memory Policy.
> > I've tried to address comments on the first cut from Christoph and Andi.
> > I've removed the details of the APIs and referenced the man pages "for
> > more details".  I've made a stab at addressing the interaction with
> > cpusets, but more could be done here.
> > 
> > I'm hoping we can get this merged in some form, and then update it with
> > all of the policy changes that are in the queue and/or being
> > worked--memoryless nodes, interaction with ZONE_MOVABLE, ... .
> > 
> > Lee
> > 
> > ----------------
> > 
> > [PATCH] Document Linux Memory Policy - V2
> > 
> > I couldn't find any memory policy documentation in the Documentation
> > directory, so here is my attempt to document it.
> > 
> > There's lots more that could be written about the internal design--including
> > data structures, functions, etc.  However, if you agree that this is better
> > that the nothing that exists now, perhaps it could be merged.  This will
> > provide a baseline for updates to document the many policy patches that are
> > currently being worked.
> > 
> 
> As pointed out elsewhere, you are better off describing how the policies
> appear to behave from outside. If you describe the internals to any decent
> level of detail, it'll be obsolete in 6 months time.

OK.  I'll try to do that, without losing what I consider important
semantics.  However, it'll only be obsolete if people post patches that
change the behavior w/o updating the doc.  That NEVER happens,
right? ;-)

I will note, tho', that the cpuset.txt doc, for example, contains a
section entitled "How are cpusets implemented?"  And look at
sched_domains.txt or prio_tree.txt.  Granted, other docs just describe
the internal interface, but I don't understand why this document can't
expose some implementation details where they help to explain the
semantics. :-(

> 
> > Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> >  Documentation/vm/memory_policy.txt |  278 +++++++++++++++++++++++++++++++++++++
> >  1 file changed, 278 insertions(+)
> > 
> > Index: Linux/Documentation/vm/memory_policy.txt
> > ===================================================================
> > --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> > +++ Linux/Documentation/vm/memory_policy.txt	2007-07-27 13:40:45.000000000 -0400
> > @@ -0,0 +1,278 @@
> > +
> > +What is Linux Memory Policy?
> > +
> > +In the Linux kernel, "memory policy" determines from which node the kernel will
> > +allocate memory in a NUMA system or in an emulated NUMA system.  Linux has
> > +supported platforms with Non-Uniform Memory Access architectures since 2.4.?.
> > +The current memory policy support was added to Linux 2.6 around May 2004.  This
> > +document attempts to describe the concepts and APIs of the 2.6 memory policy
> > +support.
> > +
> > +See also Documentation/cpusets.txt which describes a higher level,
> > +administrative mechanism for restricting the set of nodes from which memory
> > +policy may allocate pages.  Also, see "MEMORY POLICIES AND CPUSETS" below.
> > +
> 
> hmm. This may conflate what cpusets and memory policies are. Try
> something like;
> 
> This should not be confused with cpusets (Documentation/cpusets.txt) which
> is an administrative mechanism for restricting the usable nodes memory be
> allocated from by a set of processes. Memory policies are a programming
> interface that a NUMA-aware application can take advantage of. When both
> cpusets and policies are applied to a task, the restrictions of the cpuset
> takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details.

I like it, and will make the change.  Let's see if we get any push-back.
> 
> > +MEMORY POLICY CONCEPTS
> > +
> > +Scope of Memory Policies
> > +
> > +The Linux kernel supports four more or less distinct scopes of memory policy:
> > +
> 
> The sentence is too passive. State with certainity like
> 
> The Linux kernel supports four distinct scopes of memory policy:
> 
> Otherwise when I'm reading it I feel I must check if there are more or
> less than four types of policy.

OK.  I meant they were "more or less distinct".  I guess, really, they
are distinct...

> 
> > +    System Default Policy:  this policy is "hard coded" into the kernel.  It
> > +    is the policy that governs the all page allocations that aren't controlled
> > +    by one of the more specific policy scopes discussed below.
> > +
> 
> It's not stated what this policy means until much later. Forward
> references like that may be confusing so consider adding something like;
> 
> The default policy will allocate from the closest memory node to the currently
> running CPU and fallback to nodes in order of distance.

Well, here I'm trying to describe the "scopes", not the behavior.  I
think that the various policy scopes and their interaction is an
important semantic.  The "system default policy" just happens to
"MPOL_DEFAULT" when the system is up and running.  However, during boot,
it is MPOL_INTERLEAVE.  

> 
> > +    Task/Process Policy:  this is an optional, per-task policy.  When defined
> > +    for a specific task, this policy controls all page allocations made by or
> > +    on behalf of the task that aren't controlled by a more specific scope.
> > +    If a task does not define a task policy, then all page allocations that
> > +    would have been controlled by the task policy "fall back" to the System
> > +    Default Policy.
> > +
> 
> Consider reversing the order you are talking about the policies. If you
> discuss the policies with more restricted scope and finish with the
> default policy, you can avoid future references.

Again, I'm describing scope.  However, I could describe policies before
scope, but then, I'd need to forward reference VMA policy scope when
describing MPOL_DEFAULT, because MPOL_DEFAULT has
context/scope-dependent behavior.  Circular dependency!

> 
> > +	Because task policy applies to the entire address space of a task,
> > +	it is inheritable across both fork() [clone() w/o the CLONE_VM flag]
> > +	and exec*(). 
> 
> Remove "Because" here. The policy is not inherited across fork() because
> it applies to the address space. It's because policy is stored in the
> task_struct and it's not cleared by fork() or exec().
> 
> The use of inheritable here implies that the process must take some
> special action for the child to inherit the policy. Is that the case? If
> not, say inherited instead of inheritable.

I guess what I was trying to say is that it CAN be inherited ["is
inheritable"] because it applies to the entire address space.  And, not
everything in the task struct is inherited by the child, right?   And, I
wanted to emphasize that it is inheritED across exec.  I will try to
reword.

> 
> > Thus, a parent task may establish the task policy for
> > +	a child task exec()'d from an executable image that has no awareness
> > +	of memory policy.
> > +
> > +	In a multi-threaded task, task policies apply only to the thread
> > +	[Linux kernel task] that installs the policy and any threads
> > +	subsequently created by that thread.  Any sibling threads existing
> > +	at the time a new task policy is installed retain their current
> > +	policy.
> > +
> 
> Is it worth mentioning numactl here?

Actually, I tried not to mention numactl by name--just that that APIs
and headers reside in an "out of tree" package.  This is a kernel doc
and I wasn't sure about referencing out of tree "stuff"..  Andi
suggested that I not try to describe the syscalls in any detail [thus my
updates to the man pages], and I removed that.  But, I'll figure out a
way to forward reference the brief API descriptions later in the doc.

> 
> > +	A task policy applies only to pages allocated after the policy is
> > +	installed.  Any pages already faulted in by the task remain where
> > +	they were allocated based on the policy at the time they were
> > +	allocated.
> > +
> > +    VMA Policy:  A "VMA" or "Virtual Memory Area" refers to a range of a task's
> > +    virtual adddress space.  A task may define a specific policy for a range
> > +    of its virtual address space.  This VMA policy will govern the allocation
> > +    of pages that back this region of the address space.  Any regions of the
> > +    task's address space that don't have an explicit VMA policy will fall back
> > +    to the task policy, which may itself fall back to the system default policy.
> > +
> > +	VMA policy applies ONLY to anonymous pages.  These include pages
> > +	allocated for anonymous segments, such as the task stack and heap, and
> > +	any regions of the address space mmap()ed with the MAP_ANONYMOUS flag.
> > +	Anonymous pages copied from private file mappings [files mmap()ed with
> > +	the MAP_PRIVATE flag] also obey VMA policy, if defined.
> > +
> 
> The last sentence is confusing. Does it mean that policies can be
> applied to file mappings but only if they are MAP_PRIVATE and the policy
> only comes into play during COW?

Exactly!  I'll try to reword it.

> 
> > +	VMA policies are shared between all tasks that share a virtual address
> > +	space--a.k.a. threads--independent of when the policy is installed; and
> > +	they are inherited across fork().  However, because VMA policies refer
> > +	to a specific region of a task's address space, and because the address
> > +	space is discarded and recreated on exec*(), VMA policies are NOT
> > +	inheritable across exec().  Thus, only NUMA-aware applications may
> > +	use VMA policies.
> > +
> > +	A task may install a new VMA policy on a sub-range of a previously
> > +	mmap()ed region.  When this happens, Linux splits the existing virtual
> > +	memory area into 2 or 3 VMAs, each with it's own policy.
> > +
> > +	By default, VMA policy applies only to pages allocated after the policy
> > +	is installed.  Any pages already faulted into the VMA range remain where
> > +	they were allocated based on the policy at the time they were
> > +	allocated.  However, since 2.6.16, Linux supports page migration so
> > +	that page contents can be moved to match a newly installed policy.
> > +
> 
> State what system call is needed for the migration to take place.

OK.  I'll forward ref mbind().

> 
> > +    Shared Policy:  This policy applies to "memory objects" mapped shared into
> > +    one or more tasks' distinct address spaces.  Shared policies are applied
> > +    directly to the shared object.  Thus, all tasks that attach to the object
> > +    share the policy, and all pages allocated for the shared object, by any
> > +    task, will obey the shared policy.
> > +
> > +	Currently [2.6.22], only shared memory segments, created by shmget(),
> > +	support shared policy. 
> 
> This appears to contradict the previous paragram. The last paragraph
> would imply that the policy is applied to mappings that are mmaped
> MAP_SHARED where they really only apply to shmem mappings.

Conceptually, shared policies apply to shared "memory objects".
However, the implementation is incomplete--only shmem/shm object
currently support this concept.  [I'd REALLY like to fix this, but am
getting major push back... :-(]  

> 
> > +	When shared policy support was added to Linux,
> > +	the associated data structures were added to shared hugetlbfs segments.
> > +	However, at the time, hugetlbfs did not support allocation at fault
> > +	time--a.k.a lazy allocation--so hugetlbfs segments were never "hooked
> > +	up" to the shared policy support.  Although hugetlbfs segments now
> > +	support lazy allocation, their support for shared policy has not been
> > +	completed.
> > +
> > +	Although internal to the kernel shared memory segments are really
> > +	files backed by swap space that have been mmap()ed shared into tasks'
> > +	address spaces, regular files mmap()ed shared do NOT support shared
> > +	policy.  Rather, shared page cache pages, including pages backing
> > +	private mappings that have not yet been written by the task, follow
> > +	task policy, if any, else system default policy.
> > +
> > +	The shared policy infrastructure supports different policies on subset
> > +	ranges of the shared object.  However, Linux still splits the VMA of
> > +	the task that installs the policy for each range of distinct policy.
> > +	Thus, different tasks that attach to a shared memory segment can have
> > +	different VMA configurations mapping that one shared object.
> > +
> > +Components of Memory Policies
> > +
> > +    A Linux memory policy is a tuple consisting of a "mode" and an optional set
> > +    of nodes.  The mode determine the behavior of the policy, while the optional
> > +    set of nodes can be viewed as the arguments to the behavior.
> > +
> > +   Internally, memory policies are implemented by a reference counted structure,
> > +   struct mempolicy.  Details of this structure will be discussed in context,
> > +   below.
> > +
> > +	Note:  in some functions AND in the struct mempolicy, the mode is
> > +	called "policy".  However, to avoid confusion with the policy tuple,
> > +	this document will continue to use the term "mode".
> > +
> > +   Linux memory policy supports the following 4 modes:
> > +
> > +	Default Mode--MPOL_DEFAULT:  The behavior specified by this mode is
> > +	context dependent.
> > +
> > +	    During normal system operation, the system default policy is hard
> > +	    coded to contain the Default mode.  During system boot up, the
> > +	    system default policy is temporarily set to MPOL_INTERLEAVE [see
> > +	    below] to distribute boot time allocations across all nodes in
> > +	    the system, instead of using just the node containing the boot cpu.
> > +
> > +	    In this context, default mode means "local" allocation--that is
> > +	    attempt to allocate the page from the node associated with the cpu
> > +	    where the fault occurs.  If the "local" node has no memory, or the
> > +	    node's memory can be exhausted [no free pages available], local
> > +	    allocation will attempt to allocate pages from "nearby" nodes, using
> > +	    a per node list of nodes--called zonelists--built at boot time, or
> > +	    when nodes or memory are added or removed from the system [memory
> > +	    hotplug].
> > +
> > +	    When a task/process policy or a shared policy contains the Default
> > +	    mode, this also means local allocation, as described above.
> > +
> > +	    In the context of a VMA, Default mode means "fall back to task
> > +	    policy"--which may or may not specify Default mode.  Thus, Default
> > +	    mode can not be counted on to mean local allocation when used
> > +	    on a non-shared region of the address space.  However, see
> > +	    MPOL_PREFERRED below.
> > +
> > +	    The Default mode does not use the optional set of nodes.
> > +
> > +	MPOL_BIND:  This mode specifies that memory must come from the
> > +	set of nodes specified by the policy.  The kernel builds a custom
> > +	zonelist pointed to by the zonelist member of struct mempolicy,
> > +	containing just the nodes specified by the Bind policy.  If the kernel
> 
> Omit the implementation details here. Even now it is being considered to
> have just one zonelist per-node that is filtered based on the allocation
> requirements. For MPOL_BIND, this would involve __alloc_pages() taking a
> nodemask and ignoring nodes not allowed by the mask.
> 
> It's sufficent to say that MPOL_BIND will restrict the process to allocating
> pages within a set of nodes specified by a nodemask because the end result
> from the external observer will be similar.

OK.  But, I don't want to lose the idea that, with the BIND policy,
pages will be allocated first from one of the nodes [lowest #] and then
from the next and so on.  This is important, because I've had colleagues
complain to me that it was broken.  They thought that if they bound a
multithread application to cpus on several nodes and to the same nodes
memories, they would get local allocation with fall back only to the
nodes they specified.  They really wanted cpuset semantics, but these
were not available at the time.

For me, part of the problem is that BIND takes more than one node
without taking distance into account, nor allowing the user to specify
an explicit fallback order.

If the new zonelist filtering will change the behavior vis a vis what
node is selected from those specified with the policy, and what the
fallback order is, then we should update this doc when it changes.  I
can help...

> 
> > +	is unable to allocate a page from the first node in the custom zonelist,
> > +	it moves on to the next, and so forth.  If it is unable to allocate a
> > +	page from any of the nodes in this list, the allocation will fail.
> > +
> > +	    The memory policy APIs do not specify an order in which the nodes
> > +	    will be searched.  However, unlike the per node zonelists mentioned
> > +	    above, the custom zonelist for the Bind policy do not consider the
> > +	    distance between the nodes.  Rather, the lists are built in order
> > +	    of numeric node id.
> > +
> 
> Omit the last part as well because if we were filtering nodes based on a
> mask as described above, then MPOL_BIND would actually behave similar to
> the default policy except that is uses a subset of the available nodes.
> Arguably that is more sensible behaviour for MPOL_BIND than what it does today.

OK.  I'll rework this entire section.  Again, I don't want to lose what
I think are important semantics for a user.  And, maybe by documenting
ugly behavior for all to see, we'll do something about it?

> 
> > +	MPOL_PREFERRED:  This mode specifies that the allocation should be
> > +	attempted from the single node specified in the policy.  If that
> > +	allocation fails, the kernel will search other nodes, exactly as
> > +	it would for a local allocation that started at the preferred node--
> > +	that is, using the per-node zonelists in increasing distance from
> > +	the preferred node.
> > +
> > +	    Internally, the Preferred policy uses a single node--the
> > +	    preferred_node member of struct mempolicy.
> > +
> > +	    If the Preferred policy node is '-1', then at page allocation time,
> > +	    the kernel will use the "local node" as the starting point for the
> > +	    allocation.  This is the way to specify local allocation for a
> > +	    specific range of addresses--i.e. for VMA policies.
> > +
> 
> Again, consider omitting the implementation details here. They don't
> help as such.

OK.  I'll drop the '-1' bit.  I do want to maintain the notion of the
"local" variant of preferred.  This only works because the policy
contains a specific token for the preferred_node.  Not sure how to get
this concept across without mentioning something that smells of
implementation details.

> 
> > +	MPOL_INTERLEAVED:  This mode specifies that page allocations be
> > +	interleaved, on a page granularity, across the nodes specified in
> > +	the policy.  This mode also behaves slightly differently, based on
> > +	the context where it is used:
> > +
> > +	    For allocation of anonymous pages and shared memory pages,
> > +	    Interleave mode indexes the set of nodes specified by the policy
> > +	    using the page offset of the faulting address into the segment
> > +	    [VMA] containing the address modulo the number of nodes specified
> > +	    by the policy.  It then attempts to allocate a page, starting at
> > +	    the selected node, as if the node had been specified by a Preferred
> > +	    policy or had been selected by a local allocation.  That is,
> > +	    allocation will follow the per node zonelist.
> > +
> > +	    For allocation of page cache pages, Interleave mode indexes the set
> > +	    of nodes specified by the policy using a node counter maintained
> > +	    per task.  This counter wraps around to the lowest specified node
> > +	    after it reaches the highest specified node.  This will tend to
> > +	    spread the pages out over the nodes specified by the policy based
> > +	    on the order in which they are allocated, rather than based on any
> > +	    page offset into an address range or file.  During system boot up,
> > +	    the temporary interleaved system default policy works in this
> > +	    mode.
> > +
> 
> Oddly, these implementation details are really useful. Keep this one here
> but it would be great if they were in the manual pages.
> 
> > +MEMORY POLICIES AND CPUSETS
> > +
> > +Memory policies work within cpusets as described above.  For memory policies
> > +that require a node or set of nodes, the nodes are restricted to the set of
> > +nodes whose memories are allowed by the cpuset constraints.  This can be
> > +problematic for 2 reasons:
> > +
> > +1) the memory policy APIs take physical node id's as arguments.  However, the
> > +   memory policy APIs do not provide a way to determine what nodes are valid
> > +   in the context where the application is running.  An application MAY consult
> > +   the cpuset file system [directly or via an out of tree, and not generally
> > +   available, libcpuset API] to obtain this information, but then the
> > +   application must be aware that it is running in a cpuset and use what are
> > +   intended primarily as administrative APIs.
> > +
> > +2) when tasks in two cpusets share access to a memory region, such as shared
> > +   memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and
> > +   MAP_SHARED flags, only nodes whose memories are allowed in both cpusets
> > +   may be used in the policies.  Again, obtaining this information requires
> > +   "stepping outside" the memory policy APIs to use the cpuset information.
> > +   Furthermore, if the cpusets' "allowed memory" sets are disjoint, "local"
> > +   allocation is the only valid policy.
> > +
> 
> Consider moving this section to the end. It reads better to keep the discussion
> in the context of policies for as long as possible. Otherwise it's
> 
> Section 1: policies
> Section 2: policies
> Section 3: policies + cpusets
> Section 4: policies
> 
> > +MEMORY POLICY APIs
> > +
> > +Linux supports 3 system calls for controlling memory policy.  These APIS
> 
> s/APIS/APIs/
> 
> > +always affect only the calling task, the calling task's address space, or
> > +some shared object mapped into the calling task's address space.
> > +
> > +	Note:  the headers that define these APIs and the parameter data types
> > +	for user space applications reside in a package that is not part of
> > +	the Linux kernel.  The kernel system call interfaces, with the 'sys_'
> > +	prefix, are defined in <linux/syscalls.h>; the mode and flag
> > +	definitions are defined in <linux/mempolicy.h>.
> > +
> > +Set [Task] Memory Policy:
> > +
> > +	long set_mempolicy(int mode, const unsigned long *nmask,
> > +					unsigned long maxnode);
> > +
> > +	Set's the calling task's "task/process memory policy" to mode
> > +	specified by the 'mode' argument and the set of nodes defined
> > +	by 'nmask'.  'nmask' points to a bit mask of node ids containing
> > +	at least 'maxnode' ids.
> > +
> > +	See the set_mempolicy(2) man page for more details
> > +
> > +
> > +Get [Task] Memory Policy or Related Information
> > +
> > +	long get_mempolicy(int *mode,
> > +			   const unsigned long *nmask, unsigned long maxnode,
> > +			   void *addr, int flags);
> > +
> > +	Queries the "task/process memory policy" of the calling task, or
> > +	the policy or location of a specified virtual address, depending
> > +	on the 'flags' argument.
> > +
> > +	See the get_mempolicy(2) man page for more details
> > +
> > +
> > +Install VMA/Shared Policy for a Range of Task's Address Space
> > +
> > +	long mbind(void *start, unsigned long len, int mode,
> > +		   const unsigned long *nmask, unsigned long maxnode,
> > +		   unsigned flags);
> > +
> > +	mbind() installs the policy specified by (mode, nmask, maxnodes) as
> > +	a VMA policy for the range of the calling task's address space
> > +	specified by the 'start' and 'len' arguments.  Additional actions
> > +	may be requested via the 'flags' argument.
> > +
> > +	See the mbind(2) man page for more details.
> > 
> 
> Despite the comments, this is good work and really useful. I'd be fairly
> happy with it even without further revisions. Thanks a lot for the read.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] Document Linux Memory Policy - V2
  2007-07-31 16:34                       ` Lee Schermerhorn
@ 2007-07-31 19:10                         ` Christoph Lameter
  2007-07-31 19:46                           ` Lee Schermerhorn
  2007-07-31 20:48                         ` [PATCH] Document Linux Memory Policy - V3 Lee Schermerhorn
  1 sibling, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2007-07-31 19:10 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, linux-mm, ak, KAMEZAWA Hiroyuki, akpm, pj,
	Michael Kerrisk, Randy Dunlap, Eric Whitney

On Tue, 31 Jul 2007, Lee Schermerhorn wrote:

> > Is it worth mentioning numactl here?
> 
> Actually, I tried not to mention numactl by name--just that that APIs
> and headers reside in an "out of tree" package.  This is a kernel doc
> and I wasn't sure about referencing out of tree "stuff"..  Andi
> suggested that I not try to describe the syscalls in any detail [thus my
> updates to the man pages], and I removed that.  But, I'll figure out a
> way to forward reference the brief API descriptions later in the doc.

numactl definitely must be mentioned because it is the user space API for 
these things.

> > This appears to contradict the previous paragram. The last paragraph
> > would imply that the policy is applied to mappings that are mmaped
> > MAP_SHARED where they really only apply to shmem mappings.
> 
> Conceptually, shared policies apply to shared "memory objects".
> However, the implementation is incomplete--only shmem/shm object
> currently support this concept.  [I'd REALLY like to fix this, but am
> getting major push back... :-(]  

The shmem implementation has bad semantics (affects other processes 
that are unaware of another process redirecting its memory accesses) and 
should not be extended to other types of object.

> > It's sufficent to say that MPOL_BIND will restrict the process to allocating
> > pages within a set of nodes specified by a nodemask because the end result
> > from the external observer will be similar.
> 
> OK.  But, I don't want to lose the idea that, with the BIND policy,
> pages will be allocated first from one of the nodes [lowest #] and then
> from the next and so on.  This is important, because I've had colleagues
> complain to me that it was broken.  They thought that if they bound a
> multithread application to cpus on several nodes and to the same nodes
> memories, they would get local allocation with fall back only to the
> nodes they specified.  They really wanted cpuset semantics, but these
> were not available at the time.

Right. That is something that would be fixed if we could pass a nodemask 
to alloc_pages.

> OK.  I'll rework this entire section.  Again, I don't want to lose what
> I think are important semantics for a user.  And, maybe by documenting
> ugly behavior for all to see, we'll do something about it?

Correct. I hope you include the ugly shared shmem semantics with the 
effect on unsuspecting processes?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] Document Linux Memory Policy - V2
  2007-07-31 19:10                         ` Christoph Lameter
@ 2007-07-31 19:46                           ` Lee Schermerhorn
  2007-07-31 19:58                             ` Christoph Lameter
  0 siblings, 1 reply; 60+ messages in thread
From: Lee Schermerhorn @ 2007-07-31 19:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, linux-mm, ak, KAMEZAWA Hiroyuki, akpm, pj,
	Michael Kerrisk, Randy Dunlap, Eric Whitney

On Tue, 2007-07-31 at 12:10 -0700, Christoph Lameter wrote:
> On Tue, 31 Jul 2007, Lee Schermerhorn wrote:
> 
> > > Is it worth mentioning numactl here?
> > 
> > Actually, I tried not to mention numactl by name--just that that APIs
> > and headers reside in an "out of tree" package.  This is a kernel doc
> > and I wasn't sure about referencing out of tree "stuff"..  Andi
> > suggested that I not try to describe the syscalls in any detail [thus my
> > updates to the man pages], and I removed that.  But, I'll figure out a
> > way to forward reference the brief API descriptions later in the doc.
> 
> numactl definitely must be mentioned because it is the user space API for 
> these things.

OK.  I'll mention it, but won't go into any detail as this is a kernel
tree doc.

> 
> > > This appears to contradict the previous paragram. The last paragraph
> > > would imply that the policy is applied to mappings that are mmaped
> > > MAP_SHARED where they really only apply to shmem mappings.
> > 
> > Conceptually, shared policies apply to shared "memory objects".
> > However, the implementation is incomplete--only shmem/shm object
> > currently support this concept.  [I'd REALLY like to fix this, but am
> > getting major push back... :-(]  
> 
> The shmem implementation has bad semantics (affects other processes 
> that are unaware of another process redirecting its memory accesses) and 
> should not be extended to other types of object.

<heavy sigh>  I won't rise to the bait, Christoph...

> 
> > > It's sufficent to say that MPOL_BIND will restrict the process to allocating
> > > pages within a set of nodes specified by a nodemask because the end result
> > > from the external observer will be similar.
> > 
> > OK.  But, I don't want to lose the idea that, with the BIND policy,
> > pages will be allocated first from one of the nodes [lowest #] and then
> > from the next and so on.  This is important, because I've had colleagues
> > complain to me that it was broken.  They thought that if they bound a
> > multithread application to cpus on several nodes and to the same nodes
> > memories, they would get local allocation with fall back only to the
> > nodes they specified.  They really wanted cpuset semantics, but these
> > were not available at the time.
> 
> Right. That is something that would be fixed if we could pass a nodemask 
> to alloc_pages.

OK.  We can update the doc when/if that happens.

> > OK.  I'll rework this entire section.  Again, I don't want to lose what
> > I think are important semantics for a user.  And, maybe by documenting
> > ugly behavior for all to see, we'll do something about it?
> 
> Correct. I hope you include the ugly shared shmem semantics with the 
> effect on unsuspecting processes?

Again, I refuse to bite...



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] Document Linux Memory Policy - V2
  2007-07-31 19:46                           ` Lee Schermerhorn
@ 2007-07-31 19:58                             ` Christoph Lameter
  2007-07-31 20:23                               ` Lee Schermerhorn
  0 siblings, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2007-07-31 19:58 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, linux-mm, ak, KAMEZAWA Hiroyuki, akpm, pj,
	Michael Kerrisk, Randy Dunlap, Eric Whitney

On Tue, 31 Jul 2007, Lee Schermerhorn wrote:
> Again, I refuse to bite...

Please include at least the two sides to it in your doc.

There are numerous issues with memory policies and I think we are still 
waiting for a solution that addresses these issues in a consistent way and 
improves the overall cleanness of the implementation.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] Document Linux Memory Policy - V2
  2007-07-31 19:58                             ` Christoph Lameter
@ 2007-07-31 20:23                               ` Lee Schermerhorn
  0 siblings, 0 replies; 60+ messages in thread
From: Lee Schermerhorn @ 2007-07-31 20:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, linux-mm, ak, KAMEZAWA Hiroyuki, akpm, pj,
	Michael Kerrisk, Randy Dunlap, Eric Whitney

On Tue, 2007-07-31 at 12:58 -0700, Christoph Lameter wrote:
> On Tue, 31 Jul 2007, Lee Schermerhorn wrote:
> > Again, I refuse to bite...
> 
> Please include at least the two sides to it in your doc.

I'm trying not to make any judgmental statements one way or another in
the document.  I don't think it's appropriate there.  Rather, I'm trying
to describe the current behavior.  How well I'm succeeding at this is
open to debate, I guess.

> 
> There are numerous issues with memory policies and I think we are still 
> waiting for a solution that addresses these issues in a consistent way and 
> improves the overall cleanness of the implementation.

I think we need to get the in incremental steps.  

And, I think phrases like "consistent way" and "improves overall
cleanness" are very subjective.  Exchanges will be more constructive if
folks could stop asserting their own opinions as undisputed fact.  E.g.,
see:

	http://www.generalsemantics.org/about/about-gs2.htm

	especially the "Some Formulations..." section

or:
	http://www.generalsemantics.org/about/13-common.htm



Anyway, I'm about to post V3.  Have at it.

Later,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH] Document Linux Memory Policy - V3
  2007-07-31 16:34                       ` Lee Schermerhorn
  2007-07-31 19:10                         ` Christoph Lameter
@ 2007-07-31 20:48                         ` Lee Schermerhorn
  2007-08-03 13:52                           ` Mel Gorman
  1 sibling, 1 reply; 60+ messages in thread
From: Lee Schermerhorn @ 2007-07-31 20:48 UTC (permalink / raw)
  To: linux-mm
  Cc: MelGorman, Christoph Lameter, ak, KAMEZAWA Hiroyuki, akpm, pj,
	Michael Kerrisk, Randy Dunlap, Eric Whitney

[PATCH] Document Linux Memory Policy - V3

V3 -> V2:
+ edits and rework suggested by Randy Dunlap, Mel Gorman and Christoph
  Lameter.  N.B., I couldn't make all of the changes exactly as suggested
  and retain what I consider important semantics.  Therefor, I tried to
  capture the spirit of the suggestions as best I could.

V1 -> V2:
+  Uh, I forget the details.  Rework based on suggestions by Andi Kleen
   and Christoph Lameter.  E.g., dropped syscall details and updated
   the man pages, instead.

I couldn't find any memory policy documentation in the Documentation
directory, so here is my attempt to document it.

There's lots more that could be written about the internal design--including
data structures, functions, etc.  However, if you agree that this is better
that the nothing that exists now, perhaps it could be merged.  This will
provide a baseline for updates to document the many policy patches that are
currently being worked.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/vm/memory_policy.txt |  332 +++++++++++++++++++++++++++++++++++++
 1 file changed, 332 insertions(+)

Index: Linux/Documentation/vm/memory_policy.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ Linux/Documentation/vm/memory_policy.txt	2007-07-31 15:54:50.000000000 -0400
@@ -0,0 +1,332 @@
+
+What is Linux Memory Policy?
+
+In the Linux kernel, "memory policy" determines from which node the kernel will
+allocate memory in a NUMA system or in an emulated NUMA system.  Linux has
+supported platforms with Non-Uniform Memory Access architectures since 2.4.?.
+The current memory policy support was added to Linux 2.6 around May 2004.  This
+document attempts to describe the concepts and APIs of the 2.6 memory policy
+support.
+
+Memory policies should not be confused with cpusets (Documentation/cpusets.txt)
+which is an administrative mechanism for restricting the nodes from which
+memory may be allocated by a set of processes. Memory policies are a
+programming interface that a NUMA-aware application can take advantage of.  When
+both cpusets and policies are applied to a task, the restrictions of the cpuset
+takes priority.  See "MEMORY POLICIES AND CPUSETS" below for more details.
+
+MEMORY POLICY CONCEPTS
+
+Scope of Memory Policies
+
+The Linux kernel supports _scopes_ of memory policy, described here from
+most general to most specific:
+
+    System Default Policy:  this policy is "hard coded" into the kernel.  It
+    is the policy that governs all page allocations that aren't controlled
+    by one of the more specific policy scopes discussed below.  When the
+    system is "up and running", the system default policy will use "local
+    allocation" described below.  However, during boot up, the system
+    default policy will be set to interleave allocations across all nodes
+    with "sufficient" memory, so as not to overload the initial boot node
+    with boot-time allocations.
+
+    Task/Process Policy:  this is an optional, per-task policy.  When defined
+    for a specific task, this policy controls all page allocations made by or
+    on behalf of the task that aren't controlled by a more specific scope.
+    If a task does not define a task policy, then all page allocations that
+    would have been controlled by the task policy "fall back" to the System
+    Default Policy.
+
+	The task policy applies to the entire address space of a task. Thus,
+	it is inheritable, and indeed is inherited, across both fork()
+	[clone() w/o the CLONE_VM flag] and exec*().  This allows a parent task
+	to establish the task policy for a child task exec()'d from an
+	executable image that has no awareness of memory policy.  See the
+	MEMORY POLICY APIS section, below, for an overview of the system call
+	that a task may use to set/change it's task/process policy.
+
+	In a multi-threaded task, task policies apply only to the thread
+	[Linux kernel task] that installs the policy and any threads
+	subsequently created by that thread.  Any sibling threads existing
+	at the time a new task policy is installed retain their current
+	policy.
+
+	A task policy applies only to pages allocated after the policy is
+	installed.  Any pages already faulted in by the task when the task
+	changes its task policy remain where they were allocated based on
+	the policy at the time they were allocated.
+
+    VMA Policy:  A "VMA" or "Virtual Memory Area" refers to a range of a task's
+    virtual adddress space.  A task may define a specific policy for a range
+    of its virtual address space.   See the MEMORY POLICIES APIS section,
+    below, for an overview of the mbind() system call used to set a VMA
+    policy.
+
+    A VMA policy will govern the allocation of pages that back this region of
+    the address space.  Any regions of the task's address space that don't
+    have an explicit VMA policy will fall back to the task policy, which may
+    itself fall back to the System Default Policy.
+
+    VMA policies have a few complicating details:
+
+	VMA policy applies ONLY to anonymous pages.  These include pages
+	allocated for anonymous segments, such as the task stack and heap, and
+	any regions of the address space mmap()ed with the MAP_ANONYMOUS flag.
+	If a VMA policy is applied to a file mapping, it will be ignored if
+	the mapping used the MAP_SHARED flag.  If the file mapping used the
+	MAP_PRIVATE flag, the VMA policy will only be applied when an
+	anonymous page is allocated on an attempt to write to the mapping--
+	i.e., at Copy-On-Write.
+
+	VMA policies are shared between all tasks that share a virtual address
+	space--a.k.a. threads--independent of when the policy is installed; and
+	they are inherited across fork().  However, because VMA policies refer
+	to a specific region of a task's address space, and because the address
+	space is discarded and recreated on exec*(), VMA policies are NOT
+	inheritable across exec().  Thus, only NUMA-aware applications may
+	use VMA policies.
+
+	A task may install a new VMA policy on a sub-range of a previously
+	mmap()ed region.  When this happens, Linux splits the existing virtual
+	memory area into 2 or 3 VMAs, each with it's own policy.
+
+	By default, VMA policy applies only to pages allocated after the policy
+	is installed.  Any pages already faulted into the VMA range remain
+	where they were allocated based on the policy at the time they were
+	allocated.  However, since 2.6.16, Linux supports page migration via
+	the mbind() system call, so that page contents can be moved to match
+	a newly installed policy.
+
+    Shared Policy:  Conceptually, shared policies apply to "memory objects"
+    mapped shared into one or more tasks' distinct address spaces.  An
+    application installs a shared policies the same way as VMA policies--using
+    the mbind() system call specifying a range of virtual addresses that map
+    the shared object.  However, unlike VMA policies, which can be considered
+    to be an attribute of a range of a task's address space, shared policies
+    apply directly to the shared object.  Thus, all tasks that attach to the
+    object share the policy, and all pages allocated for the shared object,
+    by any task, will obey the shared policy.
+
+	As of 2.6.22, only shared memory segments, created by shmget() or
+	mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy.  When shared
+	policy support was added to Linux, the associated data structures were
+	added to hugetlbfs shmem segments.  At the time, hugetlbfs did not
+	support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
+	shmem segments were never "hooked up" to the shared policy support.
+	Although hugetlbfs segments now support lazy allocation, their support
+	for shared policy has not been completed.
+
+	As mentioned above [re: VMA policies], allocations of page cache
+	pages for regular files mmap()ed with MAP_SHARED ignore any VMA
+	policy installed on the virtual address range backed by the shared
+	file mapping.  Rather, shared page cache pages, including pages backing
+	private mappings that have not yet been written by the task, follow
+	task policy, if any, else System Default Policy.
+
+	The shared policy infrastructure supports different policies on subset
+	ranges of the shared object.  However, Linux still splits the VMA of
+	the task that installs the policy for each range of distinct policy.
+	Thus, different tasks that attach to a shared memory segment can have
+	different VMA configurations mapping that one shared object.  This
+	can be seen by examining the /proc/<pid>/numa_maps of tasks sharing
+	a shared memory region, when one task has installed shared policy on
+	one or more ranges of the region.
+
+Components of Memory Policies
+
+    A Linux memory policy is a tuple consisting of a "mode" and an optional set
+    of nodes.  The mode determine the behavior of the policy, while the
+    optional set of nodes can be viewed as the arguments to the behavior.
+
+   Internally, memory policies are implemented by a reference counted
+   structure, struct mempolicy.  Details of this structure will be discussed
+   in context, below, as required to explain the behavior.
+
+	Note:  in some functions AND in the struct mempolicy itself, the mode
+	is called "policy".  However, to avoid confusion with the policy tuple,
+	this document will continue to use the term "mode".
+
+   Linux memory policy supports the following 4 behavioral modes:
+
+	Default Mode--MPOL_DEFAULT:  The behavior specified by this mode is
+	context or scope dependent.
+
+	    As mentioned in the Policy Scope section above, during normal
+	    system operation, the System Default Policy is hard coded to
+	    contain the Default mode.
+
+	    In this context, default mode means "local" allocation--that is
+	    attempt to allocate the page from the node associated with the cpu
+	    where the fault occurs.  If the "local" node has no memory, or the
+	    node's memory can be exhausted [no free pages available], local
+	    allocation will "fallback to"--attempt to allocate pages from--
+	    "nearby" nodes, in order of increasing "distance".
+
+		Implementation detail -- subject to change:  "Fallback" uses
+		a per node list of sibling nodes--called zonelists--built at
+		boot time, or when nodes or memory are added or removed from
+		the system [memory hotplug].  These per node zonelist are
+		constructed with nodes in order of increasing distance based
+		on information provided by the platform firmware.
+
+	    When a task/process policy or a shared policy contains the Default
+	    mode, this also means "local allocation", as described above.
+
+	    In the context of a VMA, Default mode means "fall back to task
+	    policy"--which may or may not specify Default mode.  Thus, Default
+	    mode can not be counted on to mean local allocation when used
+	    on a non-shared region of the address space.  However, see
+	    MPOL_PREFERRED below.
+
+	    The Default mode does not use the optional set of nodes.
+
+	MPOL_BIND:  This mode specifies that memory must come from the
+	set of nodes specified by the policy.
+
+	    The memory policy APIs do not specify an order in which the nodes
+	    will be searched.  However, unlike "local allocation", the Bind
+	    policy does not consider the distance between the nodes.  Rather,
+	    allocations will fallback to the nodes specified by the policy in
+	    order of numeric node id.  Like everything in Linux, this is subject
+	    to change.
+
+	MPOL_PREFERRED:  This mode specifies that the allocation should be
+	attempted from the single node specified in the policy.  If that
+	allocation fails, the kernel will search other nodes, exactly as
+	it would for a local allocation that started at the preferred node
+	in increasing distance from the preferred node.  "Local" allocation
+	policy can be viewed as a Preferred policy that starts at the node
+	containing the cpu where the allocation takes place.
+
+	    Internally, the Preferred policy uses a single node--the
+	    preferred_node member of struct mempolicy.  A "distinguished
+	    value of this preferred_node, currently '-1', is interpreted
+	    as "the node containing the cpu where the allocation takes
+	    place"--local allocation.  This is the way to specify
+	    local allocation for a specific range of addresses--i.e. for
+	    VMA policies.
+
+	MPOL_INTERLEAVED:  This mode specifies that page allocations be
+	interleaved, on a page granularity, across the nodes specified in
+	the policy.  This mode also behaves slightly differently, based on
+	the context where it is used:
+
+	    For allocation of anonymous pages and shared memory pages,
+	    Interleave mode indexes the set of nodes specified by the policy
+	    using the page offset of the faulting address into the segment
+	    [VMA] containing the address modulo the number of nodes specified
+	    by the policy.  It then attempts to allocate a page, starting at
+	    the selected node, as if the node had been specified by a Preferred
+	    policy or had been selected by a local allocation.  That is,
+	    allocation will follow the per node zonelist.
+
+	    For allocation of page cache pages, Interleave mode indexes the set
+	    of nodes specified by the policy using a node counter maintained
+	    per task.  This counter wraps around to the lowest specified node
+	    after it reaches the highest specified node.  This will tend to
+	    spread the pages out over the nodes specified by the policy based
+	    on the order in which they are allocated, rather than based on any
+	    page offset into an address range or file.  During system boot up,
+	    the temporary interleaved system default policy works in this
+	    mode.
+
+MEMORY POLICY APIs
+
+Linux supports 3 system calls for controlling memory policy.  These APIS
+always affect only the calling task, the calling task's address space, or
+some shared object mapped into the calling task's address space.
+
+	Note:  the headers that define these APIs and the parameter data types
+	for user space applications reside in a package that is not part of
+	the Linux kernel.  The kernel system call interfaces, with the 'sys_'
+	prefix, are defined in <linux/syscalls.h>; the mode and flag
+	definitions are defined in <linux/mempolicy.h>.
+
+Set [Task] Memory Policy:
+
+	long set_mempolicy(int mode, const unsigned long *nmask,
+					unsigned long maxnode);
+
+	Set's the calling task's "task/process memory policy" to mode
+	specified by the 'mode' argument and the set of nodes defined
+	by 'nmask'.  'nmask' points to a bit mask of node ids containing
+	at least 'maxnode' ids.
+
+	See the set_mempolicy(2) man page for more details
+
+
+Get [Task] Memory Policy or Related Information
+
+	long get_mempolicy(int *mode,
+			   const unsigned long *nmask, unsigned long maxnode,
+			   void *addr, int flags);
+
+	Queries the "task/process memory policy" of the calling task, or
+	the policy or location of a specified virtual address, depending
+	on the 'flags' argument.
+
+	See the get_mempolicy(2) man page for more details
+
+
+Install VMA/Shared Policy for a Range of Task's Address Space
+
+	long mbind(void *start, unsigned long len, int mode,
+		   const unsigned long *nmask, unsigned long maxnode,
+		   unsigned flags);
+
+	mbind() installs the policy specified by (mode, nmask, maxnodes) as
+	a VMA policy for the range of the calling task's address space
+	specified by the 'start' and 'len' arguments.  Additional actions
+	may be requested via the 'flags' argument.
+
+	See the mbind(2) man page for more details.
+
+MEMORY POLICY COMMAND LINE INTERFACE
+
+Although not strictly part of the Linux implementation of memory policy,
+a command line tool, numactl(8), exists that allows one to:
+
++ set the task policy for a specified program via set_mempolicy(2), fork(2) and
+  exec(2)
+
++ set the shared policy for a shared memory segment via mbind(2)
+
+The numactl(8) tool is packages with the run-time version of the library
+containing the memory policy system call wrappers.  Some distributions
+package the headers and compile-time libraries in a separate development
+package.
+
+
+MEMORY POLICIES AND CPUSETS
+
+Memory policies work within cpusets as described above.  For memory policies
+that require a node or set of nodes, the nodes are restricted to the set of
+nodes whose memories are allowed by the cpuset constraints.  If the
+intersection of the set of nodes specified for the policy and the set of nodes
+allowed by the cpuset is the empty set, the policy is considered invalid and
+cannot be installed.
+
+The interaction of memory policies and cpusets can be problematic for a
+couple of reasons:
+
+1) the memory policy APIs take physical node id's as arguments.  However, the
+   memory policy APIs do not provide a way to determine what nodes are valid
+   in the context where the application is running.  An application MAY consult
+   the cpuset file system [directly or via an out of tree, and not generally
+   available, libcpuset API] to obtain this information, but then the
+   application must be aware that it is running in a cpuset and use what are
+   intended primarily as administrative APIs.
+
+   However, as long as the policy specifies at least one node that is valid
+   in the controlling cpuset, the policy can be used.
+
+2) when tasks in two cpusets share access to a memory region, such as shared
+   memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and
+   MAP_SHARED flags, and any of the tasks install shared policy on the region,
+   only nodes whose memories are allowed in both cpusets may be used in the
+   policies.  Again, obtaining this information requires "stepping outside"
+   the memory policy APIs, as well as knowing in what cpusets other task might
+   be attaching to the shared region, to use the cpuset information.
+   Furthermore, if the cpusets' allowed memory sets are disjoint, "local"
+   allocation is the only valid policy.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-26 22:59         ` Mel Gorman
  2007-07-27  1:22           ` Christoph Lameter
  2007-07-27 14:24           ` Lee Schermerhorn
@ 2007-08-01 18:59           ` Lee Schermerhorn
  2007-08-02  0:36             ` KAMEZAWA Hiroyuki
  2007-08-02 17:10             ` Mel Gorman
  2 siblings, 2 replies; 60+ messages in thread
From: Lee Schermerhorn @ 2007-08-01 18:59 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Lameter, linux-mm, ak, KAMEZAWA Hiroyuki, akpm, pj

<snip>
> This patch filters only when MPOL_BIND is in use. In non-numa, the
> checks do not exist and in NUMA cases, the filtering usually does not
> take place. I'd like this to be the bug fix for policy + ZONE_MOVABLE
> and then deal with reducing zonelists to see if there is any performance
> gain as well as a simplification in how policies and cpusets are
> implemented.
> 
> Testing shows no difference on non-numa as you'd expect and on NUMA machines,
> there are very small differences on NUMA (kernbench figures range from -0.02%
> to 0.15% differences on machines). Lee, can you test this patch in relation
> to MPOL_BIND?  I'll look at the numactl tests tomorrow as well.
> 

The patches look OK to me.  I got around to testing it today. 
Both atop the Memoryless Nodes series, and directly on 23-rc1-mm1.

Test System: 32GB 4-node ia64, booted with kernelcore=24G.
Yields, about 2GB Movable, and 6G Normal per node.

Filtered zoneinfo:

Node 0, zone   Normal
  pages free     416464
        spanned  425984
        present  424528
Node 0, zone  Movable
  pages free     47195
        spanned  60416
        present  60210
Node 1, zone   Normal
  pages free     388011
        spanned  393216
        present  391871
Node 1, zone  Movable
  pages free     125940
        spanned  126976
        present  126542
Node 2, zone   Normal
  pages free     387849
        spanned  393216
        present  391872
Node 2, zone  Movable
  pages free     126285
        spanned  126976
        present  126542
Node 3, zone   Normal
  pages free     388256
        spanned  393216
        present  391872
Node 3, zone  Movable
  pages free     126575
        spanned  126966
        present  126490
Node 4, zone      DMA
  pages free     31689
        spanned  32767
        present  32656
---
Attempt to allocate a 12G--i.e., > 4*2G--segment interleaved
across nodes 0-3 with memtoy.   I figured this would use up
all of ZONE_MOVABLE on each node and then dip into NORMAL.

root@gwydyr(root):memtoy
memtoy pid:  6558
memtoy>anon a1 12g
memtoy>map a1
memtoy>mbind a1 interleave 0,1,2,3
memtoy>touch a1 w
memtoy:  touched 786432 pages in 10.542 secs

Yields:

Node 0, zone   Normal
  pages free     328392
        spanned  425984
        present  424528
Node 0, zone  Movable
  pages free     37
        spanned  60416
        present  60210
Node 1, zone   Normal
  pages free     300293
        spanned  393216
        present  391871
Node 1, zone  Movable
  pages free     91
        spanned  126976
        present  126542
Node 2, zone   Normal
  pages free     300193
        spanned  393216
        present  391872
Node 2, zone  Movable
  pages free     49
        spanned  126976
        present  126542
Node 3, zone   Normal
  pages free     300448
        spanned  393216
        present  391872
Node 3, zone  Movable
  pages free     56
        spanned  126966
        present  126490
Node 4, zone      DMA
  pages free     31689
        spanned  32767
        present  32656

Looks like most of the movable zone in each node [~8G]
and remainder from normal zones.  Should be ~1G from 
zone normal of each node.  However, memtoy shows something
weird, looking at the location of the 1st 64 pages at each
1G boundary.  Most pages are located as I "expect" [well, I'm
not sure why we start with node 2 at offset 0, instead of 
node 0].

memtoy>where a1
a 0x2000000003c08000 0x000300000000 0x000000000000  rw- private a1
page offset    +00 +01 +02 +03 +04 +05 +06 +07
           0:    2   3   0   1   2   3   0   1
           8:    2   3   0   1   2   3   0   1
          10:    2   3   0   1   2   3   0   1
          18:    2   3   0   1   2   3   0   1
          20:    2   3   0   1   2   3   0   1
          28:    2   3   0   1   2   3   0   1
          30:    2   3   0   1   2   3   0   1
          38:    2   3   0   1   2   3   0   1

Same at 1G, 2G and 3G
But, between ~4G through 6+G [I didn't check any finer
granuality and didn't want to watch > 780K pages scroll
by] show:

memtoy>where a1 4g 64p
a 0x2000000003c08000 0x000300000000 0x000000000000  rw- private a1
page offset    +00 +01 +02 +03 +04 +05 +06 +07
       40000:    2   3   1   1   2   3   1   1
       40008:    2   3   1   1   2   3   1   1
       40010:    2   3   1   1   2   3   1   1
       40018:    2   3   1   1   2   3   1   1
       40020:    2   3   1   1   2   3   1   1
       40028:    2   3   1   1   2   3   1   1
       40030:    2   3   1   1   2   3   1   1
       40038:    2   3   1   1   2   3   1   1

Same at 5G, then:

memtoy>where a1 6g 64p
a 0x2000000003c08000 0x000300000000 0x000000000000  rw- private a1
page offset    +00 +01 +02 +03 +04 +05 +06 +07
       60000:    2   3   2   2   2   3   2   2
       60008:    2   3   2   2   2   3   2   2
       60010:    2   3   2   2   2   3   2   2
       60018:    2   3   2   2   2   3   2   2
       60020:    2   3   2   2   2   3   2   2
       60028:    2   3   2   2   2   3   2   2
       60030:    2   3   2   2   2   3   2   2
       60038:    2   3   2   2   2   3   2   2

7G, 8G, ... 11G back to expected pattern.

Thought this might be due to interaction with memoryless node patches, 
so I backed those out and tested Mel's patch again.  This time I
ran memtoy in batch mode and dumped the entire segment page locations
to a file.  Did this twice.   Both looked pretty much the same--i.e.,
the change in pattern occurs at around the same offset into the
segment.  Note that here, the interleave starts at node 3 at offset
zero.

memtoy>where a1 0 0
a 0x200000000047c000 0x000300000000 0x000000000000  rw- private a1
page offset    +00 +01 +02 +03 +04 +05 +06 +07
           0:    3   0   1   2   3   0   1   2
           8:    3   0   1   2   3   0   1   2
          10:    3   0   1   2   3   0   1   2
...
       38c20:    3   0   1   2   3   0   1   2
       38c28:    3   0   1   2   3   0   1   2
       38c30:    3   1   1   2   3   1   1   2
       38c38:    3   1   1   2   3   1   1   2
       38c40:    3   1   1   2   3   1   1   2
...
       5a0c0:    3   1   1   2   3   1   1   2
       5a0c8:    3   1   1   2   3   1   1   2
       5a0d0:    3   1   1   2   3   2   2   2
       5a0d8:    3   2   2   2   3   2   2   2
       5a0e0:    3   2   2   2   3   2   2   2
...
       65230:    3   2   2   2   3   2   2   2
       65238:    3   2   2   2   3   2   2   2
       65240:    3   2   2   2   3   3   3   3
       65248:    3   3   3   3   3   3   3   3
       65250:    3   3   3   3   3   3   3   3
...
       6ab60:    3   3   3   3   3   3   3   3
       6ab68:    3   3   3   3   3   3   3   3
       6ab70:    3   3   3   2   3   0   1   2
       6ab78:    3   0   1   2   3   0   1   2
       6ab80:    3   0   1   2   3   0   1   2
...
and so on to the end of the segment:
       bffe8:    3   0   1   2   3   0   1   2
       bfff0:    3   0   1   2   3   0   1   2
       bfff8:    3   0   1   2   3   0   1   2

The pattern changes occur at about page offsets:

0x38800 = ~ 3.6G
0x5a000 = ~ 5.8G
0x65000 = ~ 6.4G
0x6aa00 = ~ 6.8G

Then I checked zonelist order:
Built 5 zonelists in Zone order, mobility grouping on.  Total pages: 2072583

Looks like we're falling back to ZONE_MOVABLE on the next node when ZONE_MOVABLE
on target node overflows.

Rebooted to "Node order" [numa_zonelist_order sysctl missing in 23-rc1-mm1]
and tried again.  Saw "expected" interleave pattern across entire 12G segment.

Kame-san's patch to just exclude the DMA zones from the zonelists is looking
better--better than changing zonelist order when zone_movable is populated!

But, Mel's patch seems to work OK.  I'll keep it in my stack for later 
stress testing.

Lee


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-08-01 18:59           ` Lee Schermerhorn
@ 2007-08-02  0:36             ` KAMEZAWA Hiroyuki
  2007-08-02 17:10             ` Mel Gorman
  1 sibling, 0 replies; 60+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-08-02  0:36 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Mel Gorman, Christoph Lameter, linux-mm, ak, akpm, pj

On Wed, 01 Aug 2007 14:59:39 -0400
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> Kame-san's patch to just exclude the DMA zones from the zonelists is looking
> better--better than changing zonelist order when zone_movable is populated!
> 

I'm now considering setting "lowmem_reserve_ratio" to appropriate value can
help node-order case. (Many cutomer uses the default (in RHEL4 = 0) and saw
troubles.). Is it not enough ?

Thanks,
-Kame.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-07-25 19:31   ` Christoph Lameter
  2007-07-26  4:15     ` KAMEZAWA Hiroyuki
  2007-07-26 13:23     ` Mel Gorman
@ 2007-08-02 14:09     ` Mel Gorman
  2007-08-02 18:56       ` Christoph Lameter
  2 siblings, 1 reply; 60+ messages in thread
From: Mel Gorman @ 2007-08-02 14:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, akpm, pj

On (25/07/07 12:31), Christoph Lameter didst pronounce:

> > Here is the patch just to handle policies with ZONE_MOVABLE. The highest
> > zone still gets treated as it does today but allocations using ZONE_MOVABLE
> > will still be policied. It has been boot-tested and a basic compile job run
> > on a x86_64 NUMA machine (elm3b6 on test.kernel.org). Is there a
> > standard test for regression testing policies?
> 
> There is a test in the numactl package by Andi Kleen.
> 

This was a whole pile of fun. I tried to use the regression test from numactl
0.9.10 and found it failed on a number of kernels - 2.6.23-rc1, 2.6.22,
2.6.21, 2.6.20 etc with an x86_64. Was this known or did it just work for
other people? Whether this test is buggy or not is a matter of definition.

The regression tests depend on reading a numastat file from /sys before and
after running a program that consumes memory called memhog. The tests both
numactl and the numa APIs. The values in numastat are checked before and
after memhog runs to make sure the values are as expected.

This is all great and grand until you realise those counters are not guaranteed
to be up-to-date. They are per-cpu variables were are refreshed every second
by default. This means when the regression test reads them immediately after
memhog exits, it may read a stale value and "fail". If it had waited a few
seconds and tried again, it would have got the right value and passed.

Hence the regression test is dependant on timing. The question is if the values
should always be up-to-date when read from userspace. I put together one patch
that would refresh the counters when numastat or vmstat was being read but it
requires a per-cpu function to be called. This may be undesirable as it would
be punishing on large systems running tools that frequently read /proc/vmstat
for example. Was it done this way on purpose? The comments around the stats
code would led me to believe this lag is on purpose to avoid per-cpu calls.

The alternative was to apply this patch to numactl so that the
regression test waits on the timers to update. With this patch, the
regression tests passed on a 4-node x86_64 machine.

Signed-off-by: Mel Gorman <mel.csn.ul.ie>

---
 regress |    8 ++++++++
 1 file changed, 8 insertions(+)

diff -ru numactl-0.9.10-orig/test/regress numactl-0.9.10/test/regress
--- numactl-0.9.10-orig/test/regress	2007-08-01 19:56:07.000000000 +0100
+++ numactl-0.9.10/test/regress	2007-08-02 14:49:16.000000000 +0100
@@ -7,11 +7,18 @@
 SIZE=$[30 * $MB]
 DEMOSIZE=$[10 * $MB]
 VALGRIND=${VALGRIND:-}
+STAT_INTERVAL=5
 
 numactl() { 
 	$VALGRIND ../numactl "$@"
 }
 
+# Get the interval vm statistics refresh at
+if [ -e /proc/sys/vm/stat_interval ]; then
+	STAT_INTERVAL=`cat /proc/sys/vm/stat_interval`
+	STAT_INTERVAL=`expr $STAT_INTERVAL \* 2`
+fi
+
 BASE=`pwd`/..
 export LD_LIBRARY_PATH=$BASE
 export PATH=$BASE:$PATH
@@ -40,6 +47,7 @@
 
 # args: statname node
 nstat() { 
+    sleep $STAT_INTERVAL
     declare -a fields
     numastat | grep $1 | while read -a fields ; do	
 	echo ${fields[$[1 + $2]]}
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-08-01 18:59           ` Lee Schermerhorn
  2007-08-02  0:36             ` KAMEZAWA Hiroyuki
@ 2007-08-02 17:10             ` Mel Gorman
  2007-08-02 17:51               ` Lee Schermerhorn
  1 sibling, 1 reply; 60+ messages in thread
From: Mel Gorman @ 2007-08-02 17:10 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, linux-mm, ak, KAMEZAWA Hiroyuki, akpm, pj

On (01/08/07 14:59), Lee Schermerhorn didst pronounce:
> <snip>
> > This patch filters only when MPOL_BIND is in use. In non-numa, the
> > checks do not exist and in NUMA cases, the filtering usually does not
> > take place. I'd like this to be the bug fix for policy + ZONE_MOVABLE
> > and then deal with reducing zonelists to see if there is any performance
> > gain as well as a simplification in how policies and cpusets are
> > implemented.
> > 
> > Testing shows no difference on non-numa as you'd expect and on NUMA machines,
> > there are very small differences on NUMA (kernbench figures range from -0.02%
> > to 0.15% differences on machines). Lee, can you test this patch in relation
> > to MPOL_BIND?  I'll look at the numactl tests tomorrow as well.
> > 
> 
> The patches look OK to me.  I got around to testing it today. 
> Both atop the Memoryless Nodes series, and directly on 23-rc1-mm1.
> 

Excellent. Thanks for the test. I hadn't seen memtool in use before, it
looks great for investigating this sort of thing.

> Test System: 32GB 4-node ia64, booted with kernelcore=24G.
> Yields, about 2GB Movable, and 6G Normal per node.
> 
> Filtered zoneinfo:
> 
> Node 0, zone   Normal
>   pages free     416464
>         spanned  425984
>         present  424528
> Node 0, zone  Movable
>   pages free     47195
>         spanned  60416
>         present  60210
> Node 1, zone   Normal
>   pages free     388011
>         spanned  393216
>         present  391871
> Node 1, zone  Movable
>   pages free     125940
>         spanned  126976
>         present  126542
> Node 2, zone   Normal
>   pages free     387849
>         spanned  393216
>         present  391872
> Node 2, zone  Movable
>   pages free     126285
>         spanned  126976
>         present  126542
> Node 3, zone   Normal
>   pages free     388256
>         spanned  393216
>         present  391872
> Node 3, zone  Movable
>   pages free     126575
>         spanned  126966
>         present  126490
> Node 4, zone      DMA
>   pages free     31689
>         spanned  32767
>         present  32656
> ---
> Attempt to allocate a 12G--i.e., > 4*2G--segment interleaved
> across nodes 0-3 with memtoy.   I figured this would use up
> all of ZONE_MOVABLE on each node and then dip into NORMAL.
> 
> root@gwydyr(root):memtoy
> memtoy pid:  6558
> memtoy>anon a1 12g
> memtoy>map a1
> memtoy>mbind a1 interleave 0,1,2,3
> memtoy>touch a1 w
> memtoy:  touched 786432 pages in 10.542 secs
> 
> Yields:
> 
> Node 0, zone   Normal
>   pages free     328392
>         spanned  425984
>         present  424528
> Node 0, zone  Movable
>   pages free     37
>         spanned  60416
>         present  60210
> Node 1, zone   Normal
>   pages free     300293
>         spanned  393216
>         present  391871
> Node 1, zone  Movable
>   pages free     91
>         spanned  126976
>         present  126542
> Node 2, zone   Normal
>   pages free     300193
>         spanned  393216
>         present  391872
> Node 2, zone  Movable
>   pages free     49
>         spanned  126976
>         present  126542
> Node 3, zone   Normal
>   pages free     300448
>         spanned  393216
>         present  391872
> Node 3, zone  Movable
>   pages free     56
>         spanned  126966
>         present  126490
> Node 4, zone      DMA
>   pages free     31689
>         spanned  32767
>         present  32656
> 
> Looks like most of the movable zone in each node [~8G]
> and remainder from normal zones.  Should be ~1G from 
> zone normal of each node.  However, memtoy shows something
> weird, looking at the location of the 1st 64 pages at each
> 1G boundary.  Most pages are located as I "expect" [well, I'm
> not sure why we start with node 2 at offset 0, instead of 
> node 0].

Could it simply because the process started on node 2?  alloc_page_interleave()
would have taken the zonelist on that node then.

> 
> memtoy>where a1
> a 0x2000000003c08000 0x000300000000 0x000000000000  rw- private a1
> page offset    +00 +01 +02 +03 +04 +05 +06 +07
>            0:    2   3   0   1   2   3   0   1
>            8:    2   3   0   1   2   3   0   1
>           10:    2   3   0   1   2   3   0   1
>           18:    2   3   0   1   2   3   0   1
>           20:    2   3   0   1   2   3   0   1
>           28:    2   3   0   1   2   3   0   1
>           30:    2   3   0   1   2   3   0   1
>           38:    2   3   0   1   2   3   0   1
> 
> Same at 1G, 2G and 3G
> But, between ~4G through 6+G [I didn't check any finer
> granuality and didn't want to watch > 780K pages scroll
> by] show:
> 
> memtoy>where a1 4g 64p
> a 0x2000000003c08000 0x000300000000 0x000000000000  rw- private a1
> page offset    +00 +01 +02 +03 +04 +05 +06 +07
>        40000:    2   3   1   1   2   3   1   1
>        40008:    2   3   1   1   2   3   1   1
>        40010:    2   3   1   1   2   3   1   1
>        40018:    2   3   1   1   2   3   1   1
>        40020:    2   3   1   1   2   3   1   1
>        40028:    2   3   1   1   2   3   1   1
>        40030:    2   3   1   1   2   3   1   1
>        40038:    2   3   1   1   2   3   1   1
> 
> Same at 5G, then:
> 
> memtoy>where a1 6g 64p
> a 0x2000000003c08000 0x000300000000 0x000000000000  rw- private a1
> page offset    +00 +01 +02 +03 +04 +05 +06 +07
>        60000:    2   3   2   2   2   3   2   2
>        60008:    2   3   2   2   2   3   2   2
>        60010:    2   3   2   2   2   3   2   2
>        60018:    2   3   2   2   2   3   2   2
>        60020:    2   3   2   2   2   3   2   2
>        60028:    2   3   2   2   2   3   2   2
>        60030:    2   3   2   2   2   3   2   2
>        60038:    2   3   2   2   2   3   2   2
> 
> 7G, 8G, ... 11G back to expected pattern.
> 
> Thought this might be due to interaction with memoryless node patches, 
> so I backed those out and tested Mel's patch again.  This time I
> ran memtoy in batch mode and dumped the entire segment page locations
> to a file.  Did this twice.   Both looked pretty much the same--i.e.,
> the change in pattern occurs at around the same offset into the
> segment.  Note that here, the interleave starts at node 3 at offset
> zero.
> 
> memtoy>where a1 0 0
> a 0x200000000047c000 0x000300000000 0x000000000000  rw- private a1
> page offset    +00 +01 +02 +03 +04 +05 +06 +07
>            0:    3   0   1   2   3   0   1   2
>            8:    3   0   1   2   3   0   1   2
>           10:    3   0   1   2   3   0   1   2
> ...
>        38c20:    3   0   1   2   3   0   1   2
>        38c28:    3   0   1   2   3   0   1   2
>        38c30:    3   1   1   2   3   1   1   2
>        38c38:    3   1   1   2   3   1   1   2
>        38c40:    3   1   1   2   3   1   1   2
> ...
>        5a0c0:    3   1   1   2   3   1   1   2
>        5a0c8:    3   1   1   2   3   1   1   2
>        5a0d0:    3   1   1   2   3   2   2   2
>        5a0d8:    3   2   2   2   3   2   2   2
>        5a0e0:    3   2   2   2   3   2   2   2
> ...
>        65230:    3   2   2   2   3   2   2   2
>        65238:    3   2   2   2   3   2   2   2
>        65240:    3   2   2   2   3   3   3   3
>        65248:    3   3   3   3   3   3   3   3
>        65250:    3   3   3   3   3   3   3   3
> ...
>        6ab60:    3   3   3   3   3   3   3   3
>        6ab68:    3   3   3   3   3   3   3   3
>        6ab70:    3   3   3   2   3   0   1   2
>        6ab78:    3   0   1   2   3   0   1   2
>        6ab80:    3   0   1   2   3   0   1   2
> ...
> and so on to the end of the segment:
>        bffe8:    3   0   1   2   3   0   1   2
>        bfff0:    3   0   1   2   3   0   1   2
>        bfff8:    3   0   1   2   3   0   1   2
> 
> The pattern changes occur at about page offsets:
> 
> 0x38800 = ~ 3.6G
> 0x5a000 = ~ 5.8G
> 0x65000 = ~ 6.4G
> 0x6aa00 = ~ 6.8G
> 
> Then I checked zonelist order:
> Built 5 zonelists in Zone order, mobility grouping on.  Total pages: 2072583
> 
> Looks like we're falling back to ZONE_MOVABLE on the next node when ZONE_MOVABLE
> on target node overflows.
> 

Ok, which might have been unexpected to you, but it's behaving as
advertised for zonelists.

> Rebooted to "Node order" [numa_zonelist_order sysctl missing in 23-rc1-mm1]
> and tried again.  Saw "expected" interleave pattern across entire 12G segment.
> 
> Kame-san's patch to just exclude the DMA zones from the zonelists is looking
> better--better than changing zonelist order when zone_movable is populated!
> 
> But, Mel's patch seems to work OK.  I'll keep it in my stack for later 
> stress testing.
> 

Great. As this has passed your tests and it passes the numactl
regression tests (when patched for timing problems) with and without
kernelcore, I reckon it's good as a bugfix.

Thanks Lee

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-08-02 17:10             ` Mel Gorman
@ 2007-08-02 17:51               ` Lee Schermerhorn
  0 siblings, 0 replies; 60+ messages in thread
From: Lee Schermerhorn @ 2007-08-02 17:51 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Lameter, linux-mm, ak, KAMEZAWA Hiroyuki, akpm, pj

On Thu, 2007-08-02 at 18:10 +0100, Mel Gorman wrote:
> On (01/08/07 14:59), Lee Schermerhorn didst pronounce:
> > <snip>
> > > This patch filters only when MPOL_BIND is in use. In non-numa, the
> > > checks do not exist and in NUMA cases, the filtering usually does not
> > > take place. I'd like this to be the bug fix for policy + ZONE_MOVABLE
> > > and then deal with reducing zonelists to see if there is any performance
> > > gain as well as a simplification in how policies and cpusets are
> > > implemented.
> > > 
> > > Testing shows no difference on non-numa as you'd expect and on NUMA machines,
> > > there are very small differences on NUMA (kernbench figures range from -0.02%
> > > to 0.15% differences on machines). Lee, can you test this patch in relation
> > > to MPOL_BIND?  I'll look at the numactl tests tomorrow as well.
> > > 
> > 
> > The patches look OK to me.  I got around to testing it today. 
> > Both atop the Memoryless Nodes series, and directly on 23-rc1-mm1.
> > 
> 
> Excellent. Thanks for the test. I hadn't seen memtool in use before, it
> looks great for investigating this sort of thing.

You can grab the latest memtoy at:

http://free.linux.hp.com/~lts/Tools/memtoy-latest.tar.gz

Be sure to read the README about building.  It depends on headhers and
libraries that may not be on your system.  I also have a number of
compile time options and stub libraries that allow me to test on
non-numa platforms...   Other folks who have tried to compile it have
problems the first time, so I tried to document the issues and how to
resolve.


<snip>
> > 
> > Looks like most of the movable zone in each node [~8G]
> > and remainder from normal zones.  Should be ~1G from 
> > zone normal of each node.  However, memtoy shows something
> > weird, looking at the location of the 1st 64 pages at each
> > 1G boundary.  Most pages are located as I "expect" [well, I'm
> > not sure why we start with node 2 at offset 0, instead of 
> > node 0].
> 
> Could it simply because the process started on node 2?  alloc_page_interleave()
> would have taken the zonelist on that node then.

Except alloc_page_interleave() takes a starting node id that it gets
from interleave_nid()--which should use offset based interleaving.  I'll
instrument this to see what's going on when I get a chance.

<snip>
> > 
> > Then I checked zonelist order:
> > Built 5 zonelists in Zone order, mobility grouping on.  Total pages: 2072583
> > 
> > Looks like we're falling back to ZONE_MOVABLE on the next node when ZONE_MOVABLE
> > on target node overflows.
> > 
> 
> Ok, which might have been unexpected to you, but it's behaving as
> advertised for zonelists.

Not unexpected, once I realized what was happening.  As I replied to
Kame, if I had chosen a more realistic [???] -- i.e., smaller --
kernelcore size, I think it would worked as I first expected.

> 
> > Rebooted to "Node order" [numa_zonelist_order sysctl missing in 23-rc1-mm1]
> > and tried again.  Saw "expected" interleave pattern across entire 12G segment.
> > 
> > Kame-san's patch to just exclude the DMA zones from the zonelists is looking
> > better--better than changing zonelist order when zone_movable is populated!
> > 
> > But, Mel's patch seems to work OK.  I'll keep it in my stack for later 
> > stress testing.
> > 
> 
> Great. As this has passed your tests and it passes the numactl
> regression tests (when patched for timing problems) with and without
> kernelcore, I reckon it's good as a bugfix.
> 
> Thanks Lee
> 

My pleasure.  I learned a lot doing it...

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-08-02 14:09     ` Mel Gorman
@ 2007-08-02 18:56       ` Christoph Lameter
  2007-08-02 19:42         ` Mel Gorman
  0 siblings, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2007-08-02 18:56 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, akpm, pj

On Thu, 2 Aug 2007, Mel Gorman wrote:

> Hence the regression test is dependant on timing. The question is if the values
> should always be up-to-date when read from userspace. I put together one patch
> that would refresh the counters when numastat or vmstat was being read but it
> requires a per-cpu function to be called. This may be undesirable as it would
> be punishing on large systems running tools that frequently read /proc/vmstat
> for example. Was it done this way on purpose? The comments around the stats
> code would led me to believe this lag is on purpose to avoid per-cpu calls.

The lag was introduced with the vm statistics rework since ZVCs use 
deferred updates. We could call refresh_vm_stats before handing out the 
counters?

> The alternative was to apply this patch to numactl so that the
> regression test waits on the timers to update. With this patch, the
> regression tests passed on a 4-node x86_64 machine.

Another possible solution. Andi: Which solution would you prefer?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-08-02 18:56       ` Christoph Lameter
@ 2007-08-02 19:42         ` Mel Gorman
  2007-08-02 19:52           ` Christoph Lameter
  0 siblings, 1 reply; 60+ messages in thread
From: Mel Gorman @ 2007-08-02 19:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, akpm, pj

On (02/08/07 11:56), Christoph Lameter didst pronounce:
> On Thu, 2 Aug 2007, Mel Gorman wrote:
> 
> > Hence the regression test is dependant on timing. The question is if the values
> > should always be up-to-date when read from userspace. I put together one patch
> > that would refresh the counters when numastat or vmstat was being read but it
> > requires a per-cpu function to be called. This may be undesirable as it would
> > be punishing on large systems running tools that frequently read /proc/vmstat
> > for example. Was it done this way on purpose? The comments around the stats
> > code would led me to believe this lag is on purpose to avoid per-cpu calls.
> 
> The lag was introduced with the vm statistics rework since ZVCs use 
> deferred updates. We could call refresh_vm_stats before handing out the 
> counters?
> 

We could but as I said, this might be a problem for monitor programs because
an IPI call is involved for it to be 100% safe. I've included a patch below
to illustrate what appears to be required for the stats read always to be
up-to-date. Prehaps there is a less expensive way of doing it.

> > The alternative was to apply this patch to numactl so that the
> > regression test waits on the timers to update. With this patch, the
> > regression tests passed on a 4-node x86_64 machine.
> 
> Another possible solution. Andi: Which solution would you prefer?

Option 2 currently looks like;

--- 
diff --git a/drivers/base/node.c b/drivers/base/node.c
index cae346e..3656489 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -98,6 +98,7 @@ static SYSDEV_ATTR(meminfo, S_IRUGO, node_read_meminfo, NULL);
 
 static ssize_t node_read_numastat(struct sys_device * dev, char * buf)
 {
+	refresh_all_cpu_vm_stats();
 	return sprintf(buf,
 		       "numa_hit %lu\n"
 		       "numa_miss %lu\n"
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 75370ec..31046e2 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -213,6 +213,7 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
 void refresh_cpu_vm_stats(int);
+void refresh_all_cpu_vm_stats(void);
 #else /* CONFIG_SMP */
 
 /*
@@ -259,6 +260,7 @@ static inline void __dec_zone_page_state(struct page *page,
 #define mod_zone_page_state __mod_zone_page_state
 
 static inline void refresh_cpu_vm_stats(int cpu) { }
+static inline void refresh_all_cpu_vm_stats(int cpu) { }
 #endif
 
 #endif /* _LINUX_VMSTAT_H */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c64d169..9c75baa 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -621,6 +621,24 @@ const struct seq_operations zoneinfo_op = {
 	.show	= zoneinfo_show,
 };
 
+#ifdef CONFIG_SMP
+void __refresh_all_cpu_vm_stats(void *arg)
+{
+	refresh_cpu_vm_stats(smp_processor_id());
+}
+
+void refresh_all_cpu_vm_stats(void)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	refresh_cpu_vm_stats(smp_processor_id());
+	local_irq_restore(flags);
+
+	smp_call_function(__refresh_all_cpu_vm_stats, NULL, 0, 1);
+}
+#endif /* CONFIG_SMP */
+
 static void *vmstat_start(struct seq_file *m, loff_t *pos)
 {
 	unsigned long *v;
@@ -642,6 +660,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 	m->private = v;
 	if (!v)
 		return ERR_PTR(-ENOMEM);
+	refresh_all_cpu_vm_stats();
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 		v[i] = global_page_state(i);
 #ifdef CONFIG_VM_EVENT_COUNTERS
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-08-02 19:42         ` Mel Gorman
@ 2007-08-02 19:52           ` Christoph Lameter
  2007-08-03  9:32             ` Mel Gorman
  0 siblings, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2007-08-02 19:52 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, akpm, pj

On Thu, 2 Aug 2007, Mel Gorman wrote:

> 
> --- 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index cae346e..3656489 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -98,6 +98,7 @@ static SYSDEV_ATTR(meminfo, S_IRUGO, node_read_meminfo, NULL);
>  
>  static ssize_t node_read_numastat(struct sys_device * dev, char * buf)
>  {
> +	refresh_all_cpu_vm_stats();

The function is called refresh_vmstats(). Just export it.

>  		       "numa_miss %lu\n"
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 75370ec..31046e2 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -213,6 +213,7 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
>  extern void __dec_zone_state(struct zone *, enum zone_stat_item);
>  
>  void refresh_cpu_vm_stats(int);
> +void refresh_all_cpu_vm_stats(void);

No need to add another one.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-08-02 19:52           ` Christoph Lameter
@ 2007-08-03  9:32             ` Mel Gorman
  2007-08-03 16:36               ` Christoph Lameter
  0 siblings, 1 reply; 60+ messages in thread
From: Mel Gorman @ 2007-08-03  9:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, akpm, pj

On (02/08/07 12:52), Christoph Lameter didst pronounce:
> On Thu, 2 Aug 2007, Mel Gorman wrote:
> 
> > 
> > --- 
> > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > index cae346e..3656489 100644
> > --- a/drivers/base/node.c
> > +++ b/drivers/base/node.c
> > @@ -98,6 +98,7 @@ static SYSDEV_ATTR(meminfo, S_IRUGO, node_read_meminfo, NULL);
> >  
> >  static ssize_t node_read_numastat(struct sys_device * dev, char * buf)
> >  {
> > +	refresh_all_cpu_vm_stats();
> 
> The function is called refresh_vmstats(). Just export it.
> 
> >  		       "numa_miss %lu\n"
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index 75370ec..31046e2 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -213,6 +213,7 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
> >  extern void __dec_zone_state(struct zone *, enum zone_stat_item);
> >  
> >  void refresh_cpu_vm_stats(int);
> > +void refresh_all_cpu_vm_stats(void);
> 
> No need to add another one.
> 

diff --git a/drivers/base/node.c b/drivers/base/node.c
index cae346e..5a7f898 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -98,6 +98,7 @@ static SYSDEV_ATTR(meminfo, S_IRUGO, node_read_meminfo, NULL);
 
 static ssize_t node_read_numastat(struct sys_device * dev, char * buf)
 {
+	refresh_vm_stats();
 	return sprintf(buf,
 		       "numa_hit %lu\n"
 		       "numa_miss %lu\n"
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 75370ec..c9f6dad 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -261,4 +261,6 @@ static inline void __dec_zone_page_state(struct page *page,
 static inline void refresh_cpu_vm_stats(int cpu) { }
 #endif
 
+void refresh_vm_stats(void);
+
 #endif /* _LINUX_VMSTAT_H */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c64d169..970fb74 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -642,6 +642,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 	m->private = v;
 	if (!v)
 		return ERR_PTR(-ENOMEM);
+	refresh_vm_stats();
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 		v[i] = global_page_state(i);
 #ifdef CONFIG_VM_EVENT_COUNTERS
-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH] Document Linux Memory Policy - V3
  2007-07-31 20:48                         ` [PATCH] Document Linux Memory Policy - V3 Lee Schermerhorn
@ 2007-08-03 13:52                           ` Mel Gorman
  0 siblings, 0 replies; 60+ messages in thread
From: Mel Gorman @ 2007-08-03 13:52 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, Christoph Lameter, ak, KAMEZAWA Hiroyuki, akpm, pj,
	Michael Kerrisk, Randy Dunlap, Eric Whitney

On (31/07/07 16:48), Lee Schermerhorn didst pronounce:
> [PATCH] Document Linux Memory Policy - V3
> 
> V3 -> V2:
> + edits and rework suggested by Randy Dunlap, Mel Gorman and Christoph
>   Lameter.  N.B., I couldn't make all of the changes exactly as suggested
>   and retain what I consider important semantics.  Therefor, I tried to
>   capture the spirit of the suggestions as best I could.
> 
> V1 -> V2:
> +  Uh, I forget the details.  Rework based on suggestions by Andi Kleen
>    and Christoph Lameter.  E.g., dropped syscall details and updated
>    the man pages, instead.
> 
> I couldn't find any memory policy documentation in the Documentation
> directory, so here is my attempt to document it.
> 
> There's lots more that could be written about the internal design--including
> data structures, functions, etc.  However, if you agree that this is better
> that the nothing that exists now, perhaps it could be merged.  This will
> provide a baseline for updates to document the many policy patches that are
> currently being worked.
> 
> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 

I'm happy with that. It gets lots of useful information out in as clear
as manner as you get.

Acked-by: Mel Gorman <mel@csn.ul.ie>

>  Documentation/vm/memory_policy.txt |  332 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 332 insertions(+)
> 
> Index: Linux/Documentation/vm/memory_policy.txt
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ Linux/Documentation/vm/memory_policy.txt	2007-07-31 15:54:50.000000000 -0400
> @@ -0,0 +1,332 @@
> +
> +What is Linux Memory Policy?
> +
> +In the Linux kernel, "memory policy" determines from which node the kernel will
> +allocate memory in a NUMA system or in an emulated NUMA system.  Linux has
> +supported platforms with Non-Uniform Memory Access architectures since 2.4.?.
> +The current memory policy support was added to Linux 2.6 around May 2004.  This
> +document attempts to describe the concepts and APIs of the 2.6 memory policy
> +support.
> +
> +Memory policies should not be confused with cpusets (Documentation/cpusets.txt)
> +which is an administrative mechanism for restricting the nodes from which
> +memory may be allocated by a set of processes. Memory policies are a
> +programming interface that a NUMA-aware application can take advantage of.  When
> +both cpusets and policies are applied to a task, the restrictions of the cpuset
> +takes priority.  See "MEMORY POLICIES AND CPUSETS" below for more details.
> +
> +MEMORY POLICY CONCEPTS
> +
> +Scope of Memory Policies
> +
> +The Linux kernel supports _scopes_ of memory policy, described here from
> +most general to most specific:
> +
> +    System Default Policy:  this policy is "hard coded" into the kernel.  It
> +    is the policy that governs all page allocations that aren't controlled
> +    by one of the more specific policy scopes discussed below.  When the
> +    system is "up and running", the system default policy will use "local
> +    allocation" described below.  However, during boot up, the system
> +    default policy will be set to interleave allocations across all nodes
> +    with "sufficient" memory, so as not to overload the initial boot node
> +    with boot-time allocations.
> +
> +    Task/Process Policy:  this is an optional, per-task policy.  When defined
> +    for a specific task, this policy controls all page allocations made by or
> +    on behalf of the task that aren't controlled by a more specific scope.
> +    If a task does not define a task policy, then all page allocations that
> +    would have been controlled by the task policy "fall back" to the System
> +    Default Policy.
> +
> +	The task policy applies to the entire address space of a task. Thus,
> +	it is inheritable, and indeed is inherited, across both fork()
> +	[clone() w/o the CLONE_VM flag] and exec*().  This allows a parent task
> +	to establish the task policy for a child task exec()'d from an
> +	executable image that has no awareness of memory policy.  See the
> +	MEMORY POLICY APIS section, below, for an overview of the system call
> +	that a task may use to set/change it's task/process policy.
> +
> +	In a multi-threaded task, task policies apply only to the thread
> +	[Linux kernel task] that installs the policy and any threads
> +	subsequently created by that thread.  Any sibling threads existing
> +	at the time a new task policy is installed retain their current
> +	policy.
> +
> +	A task policy applies only to pages allocated after the policy is
> +	installed.  Any pages already faulted in by the task when the task
> +	changes its task policy remain where they were allocated based on
> +	the policy at the time they were allocated.
> +
> +    VMA Policy:  A "VMA" or "Virtual Memory Area" refers to a range of a task's
> +    virtual adddress space.  A task may define a specific policy for a range
> +    of its virtual address space.   See the MEMORY POLICIES APIS section,
> +    below, for an overview of the mbind() system call used to set a VMA
> +    policy.
> +
> +    A VMA policy will govern the allocation of pages that back this region of
> +    the address space.  Any regions of the task's address space that don't
> +    have an explicit VMA policy will fall back to the task policy, which may
> +    itself fall back to the System Default Policy.
> +
> +    VMA policies have a few complicating details:
> +
> +	VMA policy applies ONLY to anonymous pages.  These include pages
> +	allocated for anonymous segments, such as the task stack and heap, and
> +	any regions of the address space mmap()ed with the MAP_ANONYMOUS flag.
> +	If a VMA policy is applied to a file mapping, it will be ignored if
> +	the mapping used the MAP_SHARED flag.  If the file mapping used the
> +	MAP_PRIVATE flag, the VMA policy will only be applied when an
> +	anonymous page is allocated on an attempt to write to the mapping--
> +	i.e., at Copy-On-Write.
> +
> +	VMA policies are shared between all tasks that share a virtual address
> +	space--a.k.a. threads--independent of when the policy is installed; and
> +	they are inherited across fork().  However, because VMA policies refer
> +	to a specific region of a task's address space, and because the address
> +	space is discarded and recreated on exec*(), VMA policies are NOT
> +	inheritable across exec().  Thus, only NUMA-aware applications may
> +	use VMA policies.
> +
> +	A task may install a new VMA policy on a sub-range of a previously
> +	mmap()ed region.  When this happens, Linux splits the existing virtual
> +	memory area into 2 or 3 VMAs, each with it's own policy.
> +
> +	By default, VMA policy applies only to pages allocated after the policy
> +	is installed.  Any pages already faulted into the VMA range remain
> +	where they were allocated based on the policy at the time they were
> +	allocated.  However, since 2.6.16, Linux supports page migration via
> +	the mbind() system call, so that page contents can be moved to match
> +	a newly installed policy.
> +
> +    Shared Policy:  Conceptually, shared policies apply to "memory objects"
> +    mapped shared into one or more tasks' distinct address spaces.  An
> +    application installs a shared policies the same way as VMA policies--using
> +    the mbind() system call specifying a range of virtual addresses that map
> +    the shared object.  However, unlike VMA policies, which can be considered
> +    to be an attribute of a range of a task's address space, shared policies
> +    apply directly to the shared object.  Thus, all tasks that attach to the
> +    object share the policy, and all pages allocated for the shared object,
> +    by any task, will obey the shared policy.
> +
> +	As of 2.6.22, only shared memory segments, created by shmget() or
> +	mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy.  When shared
> +	policy support was added to Linux, the associated data structures were
> +	added to hugetlbfs shmem segments.  At the time, hugetlbfs did not
> +	support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
> +	shmem segments were never "hooked up" to the shared policy support.
> +	Although hugetlbfs segments now support lazy allocation, their support
> +	for shared policy has not been completed.
> +
> +	As mentioned above [re: VMA policies], allocations of page cache
> +	pages for regular files mmap()ed with MAP_SHARED ignore any VMA
> +	policy installed on the virtual address range backed by the shared
> +	file mapping.  Rather, shared page cache pages, including pages backing
> +	private mappings that have not yet been written by the task, follow
> +	task policy, if any, else System Default Policy.
> +
> +	The shared policy infrastructure supports different policies on subset
> +	ranges of the shared object.  However, Linux still splits the VMA of
> +	the task that installs the policy for each range of distinct policy.
> +	Thus, different tasks that attach to a shared memory segment can have
> +	different VMA configurations mapping that one shared object.  This
> +	can be seen by examining the /proc/<pid>/numa_maps of tasks sharing
> +	a shared memory region, when one task has installed shared policy on
> +	one or more ranges of the region.
> +
> +Components of Memory Policies
> +
> +    A Linux memory policy is a tuple consisting of a "mode" and an optional set
> +    of nodes.  The mode determine the behavior of the policy, while the
> +    optional set of nodes can be viewed as the arguments to the behavior.
> +
> +   Internally, memory policies are implemented by a reference counted
> +   structure, struct mempolicy.  Details of this structure will be discussed
> +   in context, below, as required to explain the behavior.
> +
> +	Note:  in some functions AND in the struct mempolicy itself, the mode
> +	is called "policy".  However, to avoid confusion with the policy tuple,
> +	this document will continue to use the term "mode".
> +
> +   Linux memory policy supports the following 4 behavioral modes:
> +
> +	Default Mode--MPOL_DEFAULT:  The behavior specified by this mode is
> +	context or scope dependent.
> +
> +	    As mentioned in the Policy Scope section above, during normal
> +	    system operation, the System Default Policy is hard coded to
> +	    contain the Default mode.
> +
> +	    In this context, default mode means "local" allocation--that is
> +	    attempt to allocate the page from the node associated with the cpu
> +	    where the fault occurs.  If the "local" node has no memory, or the
> +	    node's memory can be exhausted [no free pages available], local
> +	    allocation will "fallback to"--attempt to allocate pages from--
> +	    "nearby" nodes, in order of increasing "distance".
> +
> +		Implementation detail -- subject to change:  "Fallback" uses
> +		a per node list of sibling nodes--called zonelists--built at
> +		boot time, or when nodes or memory are added or removed from
> +		the system [memory hotplug].  These per node zonelist are
> +		constructed with nodes in order of increasing distance based
> +		on information provided by the platform firmware.
> +
> +	    When a task/process policy or a shared policy contains the Default
> +	    mode, this also means "local allocation", as described above.
> +
> +	    In the context of a VMA, Default mode means "fall back to task
> +	    policy"--which may or may not specify Default mode.  Thus, Default
> +	    mode can not be counted on to mean local allocation when used
> +	    on a non-shared region of the address space.  However, see
> +	    MPOL_PREFERRED below.
> +
> +	    The Default mode does not use the optional set of nodes.
> +
> +	MPOL_BIND:  This mode specifies that memory must come from the
> +	set of nodes specified by the policy.
> +
> +	    The memory policy APIs do not specify an order in which the nodes
> +	    will be searched.  However, unlike "local allocation", the Bind
> +	    policy does not consider the distance between the nodes.  Rather,
> +	    allocations will fallback to the nodes specified by the policy in
> +	    order of numeric node id.  Like everything in Linux, this is subject
> +	    to change.
> +
> +	MPOL_PREFERRED:  This mode specifies that the allocation should be
> +	attempted from the single node specified in the policy.  If that
> +	allocation fails, the kernel will search other nodes, exactly as
> +	it would for a local allocation that started at the preferred node
> +	in increasing distance from the preferred node.  "Local" allocation
> +	policy can be viewed as a Preferred policy that starts at the node
> +	containing the cpu where the allocation takes place.
> +
> +	    Internally, the Preferred policy uses a single node--the
> +	    preferred_node member of struct mempolicy.  A "distinguished
> +	    value of this preferred_node, currently '-1', is interpreted
> +	    as "the node containing the cpu where the allocation takes
> +	    place"--local allocation.  This is the way to specify
> +	    local allocation for a specific range of addresses--i.e. for
> +	    VMA policies.
> +
> +	MPOL_INTERLEAVED:  This mode specifies that page allocations be
> +	interleaved, on a page granularity, across the nodes specified in
> +	the policy.  This mode also behaves slightly differently, based on
> +	the context where it is used:
> +
> +	    For allocation of anonymous pages and shared memory pages,
> +	    Interleave mode indexes the set of nodes specified by the policy
> +	    using the page offset of the faulting address into the segment
> +	    [VMA] containing the address modulo the number of nodes specified
> +	    by the policy.  It then attempts to allocate a page, starting at
> +	    the selected node, as if the node had been specified by a Preferred
> +	    policy or had been selected by a local allocation.  That is,
> +	    allocation will follow the per node zonelist.
> +
> +	    For allocation of page cache pages, Interleave mode indexes the set
> +	    of nodes specified by the policy using a node counter maintained
> +	    per task.  This counter wraps around to the lowest specified node
> +	    after it reaches the highest specified node.  This will tend to
> +	    spread the pages out over the nodes specified by the policy based
> +	    on the order in which they are allocated, rather than based on any
> +	    page offset into an address range or file.  During system boot up,
> +	    the temporary interleaved system default policy works in this
> +	    mode.
> +
> +MEMORY POLICY APIs
> +
> +Linux supports 3 system calls for controlling memory policy.  These APIS
> +always affect only the calling task, the calling task's address space, or
> +some shared object mapped into the calling task's address space.
> +
> +	Note:  the headers that define these APIs and the parameter data types
> +	for user space applications reside in a package that is not part of
> +	the Linux kernel.  The kernel system call interfaces, with the 'sys_'
> +	prefix, are defined in <linux/syscalls.h>; the mode and flag
> +	definitions are defined in <linux/mempolicy.h>.
> +
> +Set [Task] Memory Policy:
> +
> +	long set_mempolicy(int mode, const unsigned long *nmask,
> +					unsigned long maxnode);
> +
> +	Set's the calling task's "task/process memory policy" to mode
> +	specified by the 'mode' argument and the set of nodes defined
> +	by 'nmask'.  'nmask' points to a bit mask of node ids containing
> +	at least 'maxnode' ids.
> +
> +	See the set_mempolicy(2) man page for more details
> +
> +
> +Get [Task] Memory Policy or Related Information
> +
> +	long get_mempolicy(int *mode,
> +			   const unsigned long *nmask, unsigned long maxnode,
> +			   void *addr, int flags);
> +
> +	Queries the "task/process memory policy" of the calling task, or
> +	the policy or location of a specified virtual address, depending
> +	on the 'flags' argument.
> +
> +	See the get_mempolicy(2) man page for more details
> +
> +
> +Install VMA/Shared Policy for a Range of Task's Address Space
> +
> +	long mbind(void *start, unsigned long len, int mode,
> +		   const unsigned long *nmask, unsigned long maxnode,
> +		   unsigned flags);
> +
> +	mbind() installs the policy specified by (mode, nmask, maxnodes) as
> +	a VMA policy for the range of the calling task's address space
> +	specified by the 'start' and 'len' arguments.  Additional actions
> +	may be requested via the 'flags' argument.
> +
> +	See the mbind(2) man page for more details.
> +
> +MEMORY POLICY COMMAND LINE INTERFACE
> +
> +Although not strictly part of the Linux implementation of memory policy,
> +a command line tool, numactl(8), exists that allows one to:
> +
> ++ set the task policy for a specified program via set_mempolicy(2), fork(2) and
> +  exec(2)
> +
> ++ set the shared policy for a shared memory segment via mbind(2)
> +
> +The numactl(8) tool is packages with the run-time version of the library
> +containing the memory policy system call wrappers.  Some distributions
> +package the headers and compile-time libraries in a separate development
> +package.
> +
> +
> +MEMORY POLICIES AND CPUSETS
> +
> +Memory policies work within cpusets as described above.  For memory policies
> +that require a node or set of nodes, the nodes are restricted to the set of
> +nodes whose memories are allowed by the cpuset constraints.  If the
> +intersection of the set of nodes specified for the policy and the set of nodes
> +allowed by the cpuset is the empty set, the policy is considered invalid and
> +cannot be installed.
> +
> +The interaction of memory policies and cpusets can be problematic for a
> +couple of reasons:
> +
> +1) the memory policy APIs take physical node id's as arguments.  However, the
> +   memory policy APIs do not provide a way to determine what nodes are valid
> +   in the context where the application is running.  An application MAY consult
> +   the cpuset file system [directly or via an out of tree, and not generally
> +   available, libcpuset API] to obtain this information, but then the
> +   application must be aware that it is running in a cpuset and use what are
> +   intended primarily as administrative APIs.
> +
> +   However, as long as the policy specifies at least one node that is valid
> +   in the controlling cpuset, the policy can be used.
> +
> +2) when tasks in two cpusets share access to a memory region, such as shared
> +   memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and
> +   MAP_SHARED flags, and any of the tasks install shared policy on the region,
> +   only nodes whose memories are allowed in both cpusets may be used in the
> +   policies.  Again, obtaining this information requires "stepping outside"
> +   the memory policy APIs, as well as knowing in what cpusets other task might
> +   be attaching to the shared region, to use the cpuset information.
> +   Furthermore, if the cpusets' allowed memory sets are disjoint, "local"
> +   allocation is the only valid policy.
> 

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA policy issues with ZONE_MOVABLE
  2007-08-03  9:32             ` Mel Gorman
@ 2007-08-03 16:36               ` Christoph Lameter
  0 siblings, 0 replies; 60+ messages in thread
From: Christoph Lameter @ 2007-08-03 16:36 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, Lee Schermerhorn, ak, KAMEZAWA Hiroyuki, akpm, pj

(Note there may be performance issues because of the IPI that is now 
necessary on each vmstat access... This means applications polling may 
have to reduce their frequency of access to these variables.)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2007-08-03 16:36 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-25  4:20 NUMA policy issues with ZONE_MOVABLE Christoph Lameter
2007-07-25  4:47 ` Nick Piggin
2007-07-25  5:05   ` Christoph Lameter
2007-07-25  5:24     ` Nick Piggin
2007-07-25  6:00       ` Christoph Lameter
2007-07-25  6:09         ` Nick Piggin
2007-07-25  9:32       ` Andi Kleen
2007-07-25  6:36 ` KAMEZAWA Hiroyuki
2007-07-25 11:16 ` Mel Gorman
2007-07-25 14:30   ` Lee Schermerhorn
2007-07-25 19:31   ` Christoph Lameter
2007-07-26  4:15     ` KAMEZAWA Hiroyuki
2007-07-26  4:53       ` Christoph Lameter
2007-07-26  7:41         ` KAMEZAWA Hiroyuki
2007-07-26 16:16       ` Mel Gorman
2007-07-26 18:03         ` Christoph Lameter
2007-07-26 18:26           ` Mel Gorman
2007-07-26 13:23     ` Mel Gorman
2007-07-26 18:07       ` Christoph Lameter
2007-07-26 22:59         ` Mel Gorman
2007-07-27  1:22           ` Christoph Lameter
2007-07-27  8:20             ` Mel Gorman
2007-07-27 15:45               ` Mel Gorman
2007-07-27 17:35                 ` Christoph Lameter
2007-07-27 17:46                   ` Mel Gorman
2007-07-27 18:38                     ` Christoph Lameter
2007-07-27 18:00                   ` [PATCH] Document Linux Memory Policy - V2 Lee Schermerhorn
2007-07-27 18:38                     ` Randy Dunlap
2007-07-27 19:01                       ` Lee Schermerhorn
2007-07-27 19:21                         ` Randy Dunlap
2007-07-27 18:55                     ` Christoph Lameter
2007-07-27 19:24                       ` Lee Schermerhorn
2007-07-31 15:14                     ` Mel Gorman
2007-07-31 16:34                       ` Lee Schermerhorn
2007-07-31 19:10                         ` Christoph Lameter
2007-07-31 19:46                           ` Lee Schermerhorn
2007-07-31 19:58                             ` Christoph Lameter
2007-07-31 20:23                               ` Lee Schermerhorn
2007-07-31 20:48                         ` [PATCH] Document Linux Memory Policy - V3 Lee Schermerhorn
2007-08-03 13:52                           ` Mel Gorman
2007-07-28  7:28                 ` NUMA policy issues with ZONE_MOVABLE KAMEZAWA Hiroyuki
2007-07-28 11:57                   ` Mel Gorman
2007-07-28 14:10                     ` KAMEZAWA Hiroyuki
2007-07-28 14:21                       ` KAMEZAWA Hiroyuki
2007-07-30 12:41                         ` Mel Gorman
2007-07-30 18:06                           ` Christoph Lameter
2007-07-27 14:24           ` Lee Schermerhorn
2007-08-01 18:59           ` Lee Schermerhorn
2007-08-02  0:36             ` KAMEZAWA Hiroyuki
2007-08-02 17:10             ` Mel Gorman
2007-08-02 17:51               ` Lee Schermerhorn
2007-07-26 18:09       ` Lee Schermerhorn
2007-08-02 14:09     ` Mel Gorman
2007-08-02 18:56       ` Christoph Lameter
2007-08-02 19:42         ` Mel Gorman
2007-08-02 19:52           ` Christoph Lameter
2007-08-03  9:32             ` Mel Gorman
2007-08-03 16:36               ` Christoph Lameter
2007-07-25 14:27 ` Lee Schermerhorn
2007-07-25 17:39   ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.