linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] various zone_reclaim cleanup
@ 2009-05-13  3:06 KOSAKI Motohiro
  2009-05-13  3:06 ` [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim KOSAKI Motohiro
                   ` (3 more replies)
  0 siblings, 4 replies; 45+ messages in thread
From: KOSAKI Motohiro @ 2009-05-13  3:06 UTC (permalink / raw)
  To: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter
  Cc: kosaki.motohiro

here is zone_reclaim related various cleanups.

[1/4] vmscan: change the number of the unmapped files in zone reclaim
[2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim
[3/4] vmscan: zone_reclaim use may_swap
[4/4] zone_reclaim_mode is always 0 by default


Please comment.




^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim
  2009-05-13  3:06 [PATCH 0/4] various zone_reclaim cleanup KOSAKI Motohiro
@ 2009-05-13  3:06 ` KOSAKI Motohiro
  2009-05-13 13:31   ` Rik van Riel
                     ` (2 more replies)
  2009-05-13  3:06 ` [PATCH 2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim KOSAKI Motohiro
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 45+ messages in thread
From: KOSAKI Motohiro @ 2009-05-13  3:06 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Rik van Riel,
	Christoph Lameter

Subject: [PATCH] vmscan: change the number of the unmapped files in zone reclaim

Documentation/sysctl/vm.txt says

	A percentage of the total pages in each zone.  Zone reclaim will only
	occur if more than this percentage of pages are file backed and unmapped.
	This is to insure that a minimal amount of local pages is still available for
	file I/O even if the node is overallocated.

However, zone_page_state(zone, NR_FILE_PAGES) contain some non file backed pages
(e.g. swapcache, buffer-head)

The right calculation is to use NR_{IN}ACTIVE_FILE.


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2397,6 +2397,7 @@ static int __zone_reclaim(struct zone *z
 		.isolate_pages = isolate_pages_global,
 	};
 	unsigned long slab_reclaimable;
+	long nr_unmapped_file_pages;
 
 	disable_swap_token();
 	cond_resched();
@@ -2409,9 +2410,11 @@ static int __zone_reclaim(struct zone *z
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	if (zone_page_state(zone, NR_FILE_PAGES) -
-		zone_page_state(zone, NR_FILE_MAPPED) >
-		zone->min_unmapped_pages) {
+	nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
+				 zone_page_state(zone, NR_ACTIVE_FILE) -
+				 zone_page_state(zone, NR_FILE_MAPPED);
+
+	if (nr_unmapped_file_pages > zone->min_unmapped_pages) {
 		/*
 		 * Free memory by calling shrink zone with increasing
 		 * priorities until we have enough memory freed.
@@ -2458,6 +2461,8 @@ int zone_reclaim(struct zone *zone, gfp_
 {
 	int node_id;
 	int ret;
+	long nr_unmapped_file_pages;
+	long nr_slab_reclaimable;
 
 	/*
 	 * Zone reclaim reclaims unmapped file backed pages and
@@ -2469,10 +2474,12 @@ int zone_reclaim(struct zone *zone, gfp_
 	 * if less than a specified percentage of the zone is used by
 	 * unmapped file backed pages.
 	 */
-	if (zone_page_state(zone, NR_FILE_PAGES) -
-	    zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_pages
-	    && zone_page_state(zone, NR_SLAB_RECLAIMABLE)
-			<= zone->min_slab_pages)
+	nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
+				 zone_page_state(zone, NR_ACTIVE_FILE) -
+				 zone_page_state(zone, NR_FILE_MAPPED);
+	nr_slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
+	if (nr_unmapped_file_pages <= zone->min_unmapped_pages &&
+	    nr_slab_reclaimable <= zone->min_slab_pages)
 		return 0;
 
 	if (zone_is_all_unreclaimable(zone))



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim
  2009-05-13  3:06 [PATCH 0/4] various zone_reclaim cleanup KOSAKI Motohiro
  2009-05-13  3:06 ` [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim KOSAKI Motohiro
@ 2009-05-13  3:06 ` KOSAKI Motohiro
  2009-05-13 13:35   ` Rik van Riel
                     ` (2 more replies)
  2009-05-13  3:07 ` [PATCH 3/4] vmscan: zone_reclaim use may_swap KOSAKI Motohiro
  2009-05-13  3:08 ` [PATCH 4/4] zone_reclaim_mode is always 0 by default KOSAKI Motohiro
  3 siblings, 3 replies; 45+ messages in thread
From: KOSAKI Motohiro @ 2009-05-13  3:06 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Rik van Riel,
	Christoph Lameter

Subject: [PATCH] vmscan: drop PF_SWAPWRITE from zone_reclaim

PF_SWAPWRITE mean ignore write congestion. (see may_write_to_queue())

foreground reclaim shouldn't ignore it because to write congested device cause
large IO lantency.
it isn't better than remote node allocation.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2406,7 +2406,7 @@ static int __zone_reclaim(struct zone *z
 	 * and we also need to be able to write out pages for RECLAIM_WRITE
 	 * and RECLAIM_SWAP.
 	 */
-	p->flags |= PF_MEMALLOC | PF_SWAPWRITE;
+	p->flags |= PF_MEMALLOC;
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
@@ -2453,7 +2453,7 @@ static int __zone_reclaim(struct zone *z
 	}
 
 	p->reclaim_state = NULL;
-	current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
+	current->flags &= ~PF_MEMALLOC;
 	return sc.nr_reclaimed >= nr_pages;
 }
 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 3/4] vmscan: zone_reclaim use may_swap
  2009-05-13  3:06 [PATCH 0/4] various zone_reclaim cleanup KOSAKI Motohiro
  2009-05-13  3:06 ` [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim KOSAKI Motohiro
  2009-05-13  3:06 ` [PATCH 2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim KOSAKI Motohiro
@ 2009-05-13  3:07 ` KOSAKI Motohiro
  2009-05-13 11:26   ` Johannes Weiner
                     ` (3 more replies)
  2009-05-13  3:08 ` [PATCH 4/4] zone_reclaim_mode is always 0 by default KOSAKI Motohiro
  3 siblings, 4 replies; 45+ messages in thread
From: KOSAKI Motohiro @ 2009-05-13  3:07 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Rik van Riel,
	Christoph Lameter

Subject: [PATCH] vmscan: zone_reclaim use may_swap

Documentation/sysctl/vm.txt says

	zone_reclaim_mode:

	Zone_reclaim_mode allows someone to set more or less aggressive approaches to
	reclaim memory when a zone runs out of memory. If it is set to zero then no
	zone reclaim occurs. Allocations will be satisfied from other zones / nodes
	in the system.

	This is value ORed together of

	1	= Zone reclaim on
	2	= Zone reclaim writes dirty pages out
	4	= Zone reclaim swaps pages


So, "(zone_reclaim_mode & RECLAIM_SWAP) == 0" mean we don't want to reclaim
swap-backed pages. not mapped file.

Thus, may_swap is better than may_unmap.


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2387,8 +2387,8 @@ static int __zone_reclaim(struct zone *z
 	int priority;
 	struct scan_control sc = {
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
-		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
-		.may_swap = 1,
+		.may_unmap = 1,
+		.may_swap = !!(zone_reclaim_mode & RECLAIM_SWAP),
 		.swap_cluster_max = max_t(unsigned long, nr_pages,
 					SWAP_CLUSTER_MAX),
 		.gfp_mask = gfp_mask,



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-13  3:06 [PATCH 0/4] various zone_reclaim cleanup KOSAKI Motohiro
                   ` (2 preceding siblings ...)
  2009-05-13  3:07 ` [PATCH 3/4] vmscan: zone_reclaim use may_swap KOSAKI Motohiro
@ 2009-05-13  3:08 ` KOSAKI Motohiro
  2009-05-13 14:47   ` Rik van Riel
                     ` (3 more replies)
  3 siblings, 4 replies; 45+ messages in thread
From: KOSAKI Motohiro @ 2009-05-13  3:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Rik van Riel,
	Christoph Lameter

Subject: [PATCH] zone_reclaim_mode is always 0 by default

Current linux policy is, if the machine has large remote node distance,
 zone_reclaim_mode is enabled by default because we've be able to assume to 
large distance mean large server until recently.

Unfrotunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport
memory controller. IOW it's NUMA from software view.

Some Core i7 machine has large remote node distance and zone_reclaim don't
fit desktop and small file server. it cause performance degression.

Thus, zone_reclaim == 0 is better by default. sorry, HPC gusy. 
you need to turn zone_reclaim_mode on manually now.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/page_alloc.c |    7 -------
 1 file changed, 7 deletions(-)

Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2494,13 +2494,6 @@ static void build_zonelists(pg_data_t *p
 		int distance = node_distance(local_node, node);
 
 		/*
-		 * If another node is sufficiently far away then it is better
-		 * to reclaim pages in a zone before going off node.
-		 */
-		if (distance > RECLAIM_DISTANCE)
-			zone_reclaim_mode = 1;
-
-		/*
 		 * We don't want to pressure a particular node.
 		 * So adding penalty to the first node in same
 		 * distance group to make it round-robin.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4] vmscan: zone_reclaim use may_swap
  2009-05-13  3:07 ` [PATCH 3/4] vmscan: zone_reclaim use may_swap KOSAKI Motohiro
@ 2009-05-13 11:26   ` Johannes Weiner
  2009-05-13 14:43   ` Rik van Riel
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 45+ messages in thread
From: Johannes Weiner @ 2009-05-13 11:26 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter

On Wed, May 13, 2009 at 12:07:30PM +0900, KOSAKI Motohiro wrote:
> Subject: [PATCH] vmscan: zone_reclaim use may_swap
> 
> Documentation/sysctl/vm.txt says
> 
> 	zone_reclaim_mode:
> 
> 	Zone_reclaim_mode allows someone to set more or less aggressive approaches to
> 	reclaim memory when a zone runs out of memory. If it is set to zero then no
> 	zone reclaim occurs. Allocations will be satisfied from other zones / nodes
> 	in the system.
> 
> 	This is value ORed together of
> 
> 	1	= Zone reclaim on
> 	2	= Zone reclaim writes dirty pages out
> 	4	= Zone reclaim swaps pages
> 
> 
> So, "(zone_reclaim_mode & RECLAIM_SWAP) == 0" mean we don't want to reclaim
> swap-backed pages. not mapped file.
> 
> Thus, may_swap is better than may_unmap.
> 
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Cc: Rik van Riel <riel@redhat.com>

Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim
  2009-05-13  3:06 ` [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim KOSAKI Motohiro
@ 2009-05-13 13:31   ` Rik van Riel
  2009-05-14 19:52   ` Christoph Lameter
  2009-05-18  3:15   ` Wu Fengguang
  2 siblings, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2009-05-13 13:31 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Christoph Lameter

KOSAKI Motohiro wrote:
> Subject: [PATCH] vmscan: change the number of the unmapped files in zone reclaim

> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Cc: Rik van Riel <riel@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim
  2009-05-13  3:06 ` [PATCH 2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim KOSAKI Motohiro
@ 2009-05-13 13:35   ` Rik van Riel
  2009-05-14 19:57   ` Christoph Lameter
  2009-05-18  3:33   ` Wu Fengguang
  2 siblings, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2009-05-13 13:35 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Christoph Lameter

KOSAKI Motohiro wrote:
> Subject: [PATCH] vmscan: drop PF_SWAPWRITE from zone_reclaim
> 
> PF_SWAPWRITE mean ignore write congestion. (see may_write_to_queue())
> 
> foreground reclaim shouldn't ignore it because to write congested device cause
> large IO lantency.
> it isn't better than remote node allocation.

It might be on NUMAQ (which is no longer manufactured), but
your change looks right for every other vaguely modern NUMA
architecture that I know of.

> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Cc: Rik van Riel <riel@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4] vmscan: zone_reclaim use may_swap
  2009-05-13  3:07 ` [PATCH 3/4] vmscan: zone_reclaim use may_swap KOSAKI Motohiro
  2009-05-13 11:26   ` Johannes Weiner
@ 2009-05-13 14:43   ` Rik van Riel
  2009-05-14 19:59   ` Christoph Lameter
  2009-05-18  3:35   ` Wu Fengguang
  3 siblings, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2009-05-13 14:43 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Christoph Lameter

KOSAKI Motohiro wrote:
> Subject: [PATCH] vmscan: zone_reclaim use may_swap

> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Cc: Rik van Riel <riel@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-13  3:08 ` [PATCH 4/4] zone_reclaim_mode is always 0 by default KOSAKI Motohiro
@ 2009-05-13 14:47   ` Rik van Riel
  2009-05-14  8:20     ` KOSAKI Motohiro
  2009-05-13 15:22   ` Robin Holt
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2009-05-13 14:47 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Christoph Lameter

KOSAKI Motohiro wrote:
> Subject: [PATCH] zone_reclaim_mode is always 0 by default
> 
> Current linux policy is, if the machine has large remote node distance,
>  zone_reclaim_mode is enabled by default because we've be able to assume to 
> large distance mean large server until recently.
> 
> Unfrotunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport
> memory controller. IOW it's NUMA from software view.
> 
> Some Core i7 machine has large remote node distance and zone_reclaim don't
> fit desktop and small file server. it cause performance degression.
> 
> Thus, zone_reclaim == 0 is better by default. sorry, HPC gusy. 
> you need to turn zone_reclaim_mode on manually now.

I'll believe that it causes a performance regression with the
old zone_reclaim behaviour, however the way you tweaked
zone_reclaim should make it behave a lot better, no?

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-13  3:08 ` [PATCH 4/4] zone_reclaim_mode is always 0 by default KOSAKI Motohiro
  2009-05-13 14:47   ` Rik van Riel
@ 2009-05-13 15:22   ` Robin Holt
  2009-05-14 20:05     ` Christoph Lameter
  2009-05-18  3:49   ` Wu Fengguang
  2009-05-18  9:09   ` Wu Fengguang
  3 siblings, 1 reply; 45+ messages in thread
From: Robin Holt @ 2009-05-13 15:22 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter

On Wed, May 13, 2009 at 12:08:12PM +0900, KOSAKI Motohiro wrote:
> Subject: [PATCH] zone_reclaim_mode is always 0 by default
> 
> Current linux policy is, if the machine has large remote node distance,
>  zone_reclaim_mode is enabled by default because we've be able to assume to 
> large distance mean large server until recently.
> 
> Unfrotunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport
> memory controller. IOW it's NUMA from software view.
> 
> Some Core i7 machine has large remote node distance and zone_reclaim don't
> fit desktop and small file server. it cause performance degression.
> 
> Thus, zone_reclaim == 0 is better by default. sorry, HPC gusy. 
> you need to turn zone_reclaim_mode on manually now.

I am _VERY_ concerned about this change in behavior as it has been the
default for a considerable period of time.  I realize it is an easily
changed setting, but it is churn in the default behavior.  Are there
any benefits for these small servers to have zone_reclaim turned on?
If you have a large node distance, I would expect they should benefit
_MORE_ than those with small or no node distances.

Are you seeing an impact of the load not distributing pages evenly across
processors instead of a reclaim effect (ie, a single threaded process
faulting in more memory than is node local and expecting those pages
to come from the other node first before doing reclaim)?  Maybe there
is a different issue than the ones I am used to thinking about and I am
completely missing the point, please enlighten me.

If this proceeds forward, I would like to propose we at least leave
it on for SGI SN and UV hardware.  I can provide a quick patch that
may be a bit ugly because it will depend upon arch specific #defines.
I have not investigated this, but any alternative suggestions are
certainly welcome.  Currently, I am envisioning bringing something like
ia64_platform_is("sn2") and is_uv_system into page_alloc.c.

> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Cc: Rik van Riel <riel@redhat.com>

Please add me:

Cc: Robin Holt <holt@sgi.com>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-13 14:47   ` Rik van Riel
@ 2009-05-14  8:20     ` KOSAKI Motohiro
  2009-05-14 11:48       ` Robin Holt
  0 siblings, 1 reply; 45+ messages in thread
From: KOSAKI Motohiro @ 2009-05-14  8:20 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton,
	Christoph Lameter, Robin Holt

(cc to Robin)

> KOSAKI Motohiro wrote:
> > Subject: [PATCH] zone_reclaim_mode is always 0 by default
> > 
> > Current linux policy is, if the machine has large remote node distance,
> >  zone_reclaim_mode is enabled by default because we've be able to assume to 
> > large distance mean large server until recently.
> > 
> > Unfrotunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport
> > memory controller. IOW it's NUMA from software view.
> > 
> > Some Core i7 machine has large remote node distance and zone_reclaim don't
> > fit desktop and small file server. it cause performance degression.
> > 
> > Thus, zone_reclaim == 0 is better by default. sorry, HPC gusy. 
> > you need to turn zone_reclaim_mode on manually now.
> 
> I'll believe that it causes a performance regression with the
> old zone_reclaim behaviour, however the way you tweaked
> zone_reclaim should make it behave a lot better, no?

Unfortunately no.
zone reclaim has two weakness by design.

1.
zone reclaim don't works well when workingset size > local node size.
but it can happen easily on small machine.
if it happen, zone reclaim drop own process's memory.

Plus, zone reclaim also doesn't fit DB server. its process has large
workingset.


2.
zone reclaim have inter zone balancing issue.

example: x86_64 2node 8G machine has following zone assignment

   zone 0 (DMA32):  3GB
   zone 0 (Normal): 1GB
   zone 1 (Normal): 4GB

if the page is allocated from DMA32, you are lucky. DMA32 isn't reclaimed
so freqently. but if from zone0 Normal, you are unlucky.
it is very frequent reclaimed although it is small than other zone.


I know my patch change large server default. but I believe linux
default kernel parameter adapt to desktop and entry machine.




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-14  8:20     ` KOSAKI Motohiro
@ 2009-05-14 11:48       ` Robin Holt
  2009-05-14 12:02         ` KOSAKI Motohiro
  0 siblings, 1 reply; 45+ messages in thread
From: Robin Holt @ 2009-05-14 11:48 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, LKML, linux-mm, Andrew Morton, Christoph Lameter,
	Robin Holt

> Unfortunately no.
> zone reclaim has two weakness by design.
> 
> 1.
> zone reclaim don't works well when workingset size > local node size.
> but it can happen easily on small machine.
> if it happen, zone reclaim drop own process's memory.
> 
> Plus, zone reclaim also doesn't fit DB server. its process has large
> workingset.

Large DB server is not your typical desktop application either.

> 2.
> zone reclaim have inter zone balancing issue.
> 
> example: x86_64 2node 8G machine has following zone assignment
> 
>    zone 0 (DMA32):  3GB
>    zone 0 (Normal): 1GB
>    zone 1 (Normal): 4GB
> 
> if the page is allocated from DMA32, you are lucky. DMA32 isn't reclaimed
> so freqently. but if from zone0 Normal, you are unlucky.
> it is very frequent reclaimed although it is small than other zone.

I have seen that behavior on some of our mismatched large systems as well,
although never had one so imbalanced because ia64 only has Normal.

> I know my patch change large server default. but I believe linux
> default kernel parameter adapt to desktop and entry machine.

If this imbalance is an x86_64 only problem, then we could do something
simple like the following untested patch.  This leaves the default
for everyone except x86_64.

Robin

------------------------------------------------------------------------

Even if there is a great node distance on x86_64, disable zone reclaim
by default.  This was done to handle the imbalanced zone sizes where a
majority of the memory in zone 0 is DMA32 with a small remaining Normal
which will be aggressively reclaimed.

For other architectures, we leave the default behavior.

Signed-off-by: Robin Holt <holt@sgi.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>

---
 arch/x86/include/asm/topology.h |    2 ++
 include/linux/topology.h        |    5 +++++
 mm/page_alloc.c                 |    2 +-
 3 files changed, 8 insertions(+), 1 deletion(-)
Index: page_reclaim_mode/arch/x86/include/asm/topology.h
===================================================================
--- page_reclaim_mode.orig/arch/x86/include/asm/topology.h	2009-05-14 06:44:20.118925713 -0500
+++ page_reclaim_mode/arch/x86/include/asm/topology.h	2009-05-14 06:44:21.251067716 -0500
@@ -128,6 +128,8 @@ extern unsigned long node_remap_size[];
 
 #endif
 
+#define DEFAULT_ZONE_RECLAIM_MODE	0
+
 /* sched_domains SD_NODE_INIT for NUMA machines */
 #define SD_NODE_INIT (struct sched_domain) {		\
 	.min_interval		= 8,			\
Index: page_reclaim_mode/include/linux/topology.h
===================================================================
--- page_reclaim_mode.orig/include/linux/topology.h	2009-05-14 06:44:20.070919619 -0500
+++ page_reclaim_mode/include/linux/topology.h	2009-05-14 06:44:21.279071382 -0500
@@ -61,6 +61,11 @@ int arch_update_cpu_topology(void);
  */
 #define RECLAIM_DISTANCE 20
 #endif
+
+#ifndef DEFAULT_ZONE_RECLAIM_MODE
+#define DEFAULT_ZONE_RECLAIM_MODE	1
+#endif
+
 #ifndef PENALTY_FOR_NODE_WITH_CPUS
 #define PENALTY_FOR_NODE_WITH_CPUS	(1)
 #endif
Index: page_reclaim_mode/mm/page_alloc.c
===================================================================
--- page_reclaim_mode.orig/mm/page_alloc.c	2009-05-14 06:44:20.138928363 -0500
+++ page_reclaim_mode/mm/page_alloc.c	2009-05-14 06:44:21.311075244 -0500
@@ -2331,7 +2331,7 @@ static void build_zonelists(pg_data_t *p
 		 * to reclaim pages in a zone before going off node.
 		 */
 		if (distance > RECLAIM_DISTANCE)
-			zone_reclaim_mode = 1;
+			zone_reclaim_mode = DEFAULT_ZONE_RECLAIM_MODE;
 
 		/*
 		 * We don't want to pressure a particular node.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-14 11:48       ` Robin Holt
@ 2009-05-14 12:02         ` KOSAKI Motohiro
  0 siblings, 0 replies; 45+ messages in thread
From: KOSAKI Motohiro @ 2009-05-14 12:02 UTC (permalink / raw)
  To: Robin Holt
  Cc: kosaki.motohiro, Rik van Riel, LKML, linux-mm, Andrew Morton,
	Christoph Lameter

> > Unfortunately no.
> > zone reclaim has two weakness by design.
> > 
> > 1.
> > zone reclaim don't works well when workingset size > local node size.
> > but it can happen easily on small machine.
> > if it happen, zone reclaim drop own process's memory.
> > 
> > Plus, zone reclaim also doesn't fit DB server. its process has large
> > workingset.
> 
> Large DB server is not your typical desktop application either.

ack.


> > 2.
> > zone reclaim have inter zone balancing issue.
> > 
> > example: x86_64 2node 8G machine has following zone assignment
> > 
> >    zone 0 (DMA32):  3GB
> >    zone 0 (Normal): 1GB
> >    zone 1 (Normal): 4GB
> > 
> > if the page is allocated from DMA32, you are lucky. DMA32 isn't reclaimed
> > so freqently. but if from zone0 Normal, you are unlucky.
> > it is very frequent reclaimed although it is small than other zone.
> 
> I have seen that behavior on some of our mismatched large systems as well,
> although never had one so imbalanced because ia64 only has Normal.

not true.
some ia64 server has about 2GB DMA zone. SGI ia64 is special one.


> > I know my patch change large server default. but I believe linux
> > default kernel parameter adapt to desktop and entry machine.
> 
> If this imbalance is an x86_64 only problem, then we could do something
> simple like the following untested patch.  This leaves the default
> for everyone except x86_64.

not x86_64 only.
many 64bit architecture have 2 or 4GB DMA zone.

even though, your patch seems interesting. at least it solve
desktop user issue and we don't need to care another area user.

embedded and high-end server user is typically skillfull. they can
change kernel parameter by themself.


> 
> Robin
> 
> ------------------------------------------------------------------------
> 
> Even if there is a great node distance on x86_64, disable zone reclaim
> by default.  This was done to handle the imbalanced zone sizes where a
> majority of the memory in zone 0 is DMA32 with a small remaining Normal
> which will be aggressively reclaimed.
> 
> For other architectures, we leave the default behavior.
> 
> Signed-off-by: Robin Holt <holt@sgi.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Cc: Rik van Riel <riel@redhat.com>
> 
> ---
>  arch/x86/include/asm/topology.h |    2 ++
>  include/linux/topology.h        |    5 +++++
>  mm/page_alloc.c                 |    2 +-
>  3 files changed, 8 insertions(+), 1 deletion(-)
> Index: page_reclaim_mode/arch/x86/include/asm/topology.h
> ===================================================================
> --- page_reclaim_mode.orig/arch/x86/include/asm/topology.h	2009-05-14 06:44:20.118925713 -0500
> +++ page_reclaim_mode/arch/x86/include/asm/topology.h	2009-05-14 06:44:21.251067716 -0500
> @@ -128,6 +128,8 @@ extern unsigned long node_remap_size[];
>  
>  #endif
>  
> +#define DEFAULT_ZONE_RECLAIM_MODE	0
> +
>  /* sched_domains SD_NODE_INIT for NUMA machines */
>  #define SD_NODE_INIT (struct sched_domain) {		\
>  	.min_interval		= 8,			\
> Index: page_reclaim_mode/include/linux/topology.h
> ===================================================================
> --- page_reclaim_mode.orig/include/linux/topology.h	2009-05-14 06:44:20.070919619 -0500
> +++ page_reclaim_mode/include/linux/topology.h	2009-05-14 06:44:21.279071382 -0500
> @@ -61,6 +61,11 @@ int arch_update_cpu_topology(void);
>   */
>  #define RECLAIM_DISTANCE 20
>  #endif
> +
> +#ifndef DEFAULT_ZONE_RECLAIM_MODE
> +#define DEFAULT_ZONE_RECLAIM_MODE	1
> +#endif
> +
>  #ifndef PENALTY_FOR_NODE_WITH_CPUS
>  #define PENALTY_FOR_NODE_WITH_CPUS	(1)
>  #endif
> Index: page_reclaim_mode/mm/page_alloc.c
> ===================================================================
> --- page_reclaim_mode.orig/mm/page_alloc.c	2009-05-14 06:44:20.138928363 -0500
> +++ page_reclaim_mode/mm/page_alloc.c	2009-05-14 06:44:21.311075244 -0500
> @@ -2331,7 +2331,7 @@ static void build_zonelists(pg_data_t *p
>  		 * to reclaim pages in a zone before going off node.
>  		 */
>  		if (distance > RECLAIM_DISTANCE)
> -			zone_reclaim_mode = 1;
> +			zone_reclaim_mode = DEFAULT_ZONE_RECLAIM_MODE;
>  
>  		/*
>  		 * We don't want to pressure a particular node.




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim
  2009-05-13  3:06 ` [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim KOSAKI Motohiro
  2009-05-13 13:31   ` Rik van Riel
@ 2009-05-14 19:52   ` Christoph Lameter
  2009-05-18  3:15   ` Wu Fengguang
  2 siblings, 0 replies; 45+ messages in thread
From: Christoph Lameter @ 2009-05-14 19:52 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Rik van Riel



Yup, the use of NR_FILE_PAGES there predates the INACTIVE/ACTIVE stats.

Reviewed-by: Christoph Lameter <cl@linux-foundation.org>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim
  2009-05-13  3:06 ` [PATCH 2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim KOSAKI Motohiro
  2009-05-13 13:35   ` Rik van Riel
@ 2009-05-14 19:57   ` Christoph Lameter
  2009-05-18  3:33   ` Wu Fengguang
  2 siblings, 0 replies; 45+ messages in thread
From: Christoph Lameter @ 2009-05-14 19:57 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Rik van Riel

On Wed, 13 May 2009, KOSAKI Motohiro wrote:

> Subject: [PATCH] vmscan: drop PF_SWAPWRITE from zone_reclaim
>
> PF_SWAPWRITE mean ignore write congestion. (see may_write_to_queue())
>
> foreground reclaim shouldn't ignore it because to write congested device cause
> large IO lantency.
> it isn't better than remote node allocation.

Zone reclaim by default does not perform writes. RECLAIM_WRITE must be set
for that to be effective.

Acked-by: Christoph Lameter <cl@linux-foundation.org>



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4] vmscan: zone_reclaim use may_swap
  2009-05-13  3:07 ` [PATCH 3/4] vmscan: zone_reclaim use may_swap KOSAKI Motohiro
  2009-05-13 11:26   ` Johannes Weiner
  2009-05-13 14:43   ` Rik van Riel
@ 2009-05-14 19:59   ` Christoph Lameter
  2009-05-18  3:35   ` Wu Fengguang
  3 siblings, 0 replies; 45+ messages in thread
From: Christoph Lameter @ 2009-05-14 19:59 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Rik van Riel


Acked-by: Christoph Lameter <cl@linux-foundation.org>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-13 15:22   ` Robin Holt
@ 2009-05-14 20:05     ` Christoph Lameter
  2009-05-14 20:23       ` Rik van Riel
  2009-05-15  1:02       ` KOSAKI Motohiro
  0 siblings, 2 replies; 45+ messages in thread
From: Christoph Lameter @ 2009-05-14 20:05 UTC (permalink / raw)
  To: Robin Holt; +Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, Rik van Riel


Not having zone reclaim on a NUMA system often means that per node
allocations will fall back. Optimized node local allocations become very
difficult for the page allocator. If the latency penalties are not
significant then this may not matter. The larger the system, the larger
the NUMA latencies become.

One possibility would be to disable zone reclaim for low node numbers.
Eanble it only if more than 4 nodes exist?




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-14 20:05     ` Christoph Lameter
@ 2009-05-14 20:23       ` Rik van Riel
  2009-05-14 20:31         ` Christoph Lameter
  2009-05-15  1:02       ` KOSAKI Motohiro
  1 sibling, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2009-05-14 20:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, KOSAKI Motohiro, LKML, linux-mm, Andrew Morton

Christoph Lameter wrote:
> Not having zone reclaim on a NUMA system often means that per node
> allocations will fall back. Optimized node local allocations become very
> difficult for the page allocator. If the latency penalties are not
> significant then this may not matter. The larger the system, the larger
> the NUMA latencies become.
> 
> One possibility would be to disable zone reclaim for low node numbers.
> Eanble it only if more than 4 nodes exist?

I suspect that patches 1/4 through 3/4 will cause the
system to behave better already, by only reclaiming
the easiest to reclaim pages from zone reclaim and
falling back after that - or am overlooking something?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-14 20:23       ` Rik van Riel
@ 2009-05-14 20:31         ` Christoph Lameter
  0 siblings, 0 replies; 45+ messages in thread
From: Christoph Lameter @ 2009-05-14 20:31 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Robin Holt, KOSAKI Motohiro, LKML, linux-mm, Andrew Morton

On Thu, 14 May 2009, Rik van Riel wrote:

> I suspect that patches 1/4 through 3/4 will cause the
> system to behave better already, by only reclaiming
> the easiest to reclaim pages from zone reclaim and
> falling back after that - or am overlooking something?

zone reclaims default config has always only reclaimed the easiest
reclaimable pages. Manual configuration is necessary to reclaim other
pages.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-14 20:05     ` Christoph Lameter
  2009-05-14 20:23       ` Rik van Riel
@ 2009-05-15  1:02       ` KOSAKI Motohiro
  2009-05-15 10:51         ` Robin Holt
  2009-05-15 18:01         ` Christoph Lameter
  1 sibling, 2 replies; 45+ messages in thread
From: KOSAKI Motohiro @ 2009-05-15  1:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Robin Holt, LKML, linux-mm, Andrew Morton, Rik van Riel

> Not having zone reclaim on a NUMA system often means that per node
> allocations will fall back. Optimized node local allocations become very
> difficult for the page allocator. If the latency penalties are not
> significant then this may not matter. The larger the system, the larger
> the NUMA latencies become.
> 
> One possibility would be to disable zone reclaim for low node numbers.
> Eanble it only if more than 4 nodes exist?

I think this idea works good every machine and doesn't cause confusion
to HPC user.

How about this?

==============================
Subject: [PATCH] zone_reclaim is always 0 by default on small machine

Current linux policy is, zone_reclaim_mode is enabled by default if the machine
has large remote node distance. it's because we could assume that large distance 
mean large server until recently.

Unfortunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport
memory controller. IOW it's seen as NUMA from software view.

Some Core i7 machine has large remote node distance, but zone_reclaim don't
fit desktop and small file server. it cause performance degression.

Thus, zone_reclaim == 0 is better by default if the machine is small.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <holt@sgi.com>
---
 mm/page_alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2497,7 +2497,7 @@ static void build_zonelists(pg_data_t *p
 		 * If another node is sufficiently far away then it is better
 		 * to reclaim pages in a zone before going off node.
 		 */
-		if (distance > RECLAIM_DISTANCE)
+		if (nr_online_nodes >= 4 && distance > RECLAIM_DISTANCE)
 			zone_reclaim_mode = 1;
 
 		/*



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-15  1:02       ` KOSAKI Motohiro
@ 2009-05-15 10:51         ` Robin Holt
  2009-05-19  2:53           ` KOSAKI Motohiro
  2009-05-15 18:01         ` Christoph Lameter
  1 sibling, 1 reply; 45+ messages in thread
From: Robin Holt @ 2009-05-15 10:51 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Christoph Lameter, Robin Holt, LKML, linux-mm, Andrew Morton,
	Rik van Riel

> Current linux policy is, zone_reclaim_mode is enabled by default if the machine
> has large remote node distance. it's because we could assume that large distance 
> mean large server until recently.
> 
> Unfortunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport
> memory controller. IOW it's seen as NUMA from software view.
> 
> Some Core i7 machine has large remote node distance, but zone_reclaim don't
> fit desktop and small file server. it cause performance degression.
> 
> Thus, zone_reclaim == 0 is better by default if the machine is small.

What if I had a node 0 with 32GB or 128GB of memory.  In that case,
we would have 3GB for DMA32, 125GB for Normal and then a node 1 with
128GB.  I would suggest that zone reclaim would perform normally and
be beneficial.

You are unfairly classifying this as a size of machine problem when it is
really a problem with the underlying zone reclaim code being triggered
due to imbalanced node/zones, part of which is due to a single node
having multiple zones and those multiple zones setting up the conditions
for extremely agressive reclaim.  In other words, you are putting a
bandage in place to hide a problem on your particular hardware.

Can RECLAIM_DISTANCE be adjusted so your Ci7 boxes are no longer caught?
Aren't 4 node Ci7 boxes soon to be readily available?  How are your apps
different from my apps in that you are not impacted by node locality?
Are you being too insensitive to node locality?  Conversely am I being
too sensitive?

All that said, I would not stop this from going in.  I just think the
selection criteria is rather random.  I think we know the condition we
are trying to avoid which is a small Normal zone on one node and a larger
Normal zone on another causing zone reclaim to be overly agressive.
I don't know how to quantify "small" versus "large".  I would suggest
that a node 0 with 16 or more GB should have zone reclaim on by default
as well.  Can that be expressed in the selection criteria.

Thanks,
Robin Holt

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-15  1:02       ` KOSAKI Motohiro
  2009-05-15 10:51         ` Robin Holt
@ 2009-05-15 18:01         ` Christoph Lameter
  1 sibling, 0 replies; 45+ messages in thread
From: Christoph Lameter @ 2009-05-15 18:01 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Robin Holt, LKML, linux-mm, Andrew Morton, Rik van Riel

On Fri, 15 May 2009, KOSAKI Motohiro wrote:

> How about this?

Rewiewed-by: Christoph Lameter <cl@linux-foundation.org>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim
  2009-05-13  3:06 ` [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim KOSAKI Motohiro
  2009-05-13 13:31   ` Rik van Riel
  2009-05-14 19:52   ` Christoph Lameter
@ 2009-05-18  3:15   ` Wu Fengguang
  2009-05-18  3:35     ` KOSAKI Motohiro
  2 siblings, 1 reply; 45+ messages in thread
From: Wu Fengguang @ 2009-05-18  3:15 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter

On Wed, May 13, 2009 at 12:06:28PM +0900, KOSAKI Motohiro wrote:
> Subject: [PATCH] vmscan: change the number of the unmapped files in zone reclaim
> 
> Documentation/sysctl/vm.txt says
> 
> 	A percentage of the total pages in each zone.  Zone reclaim will only
> 	occur if more than this percentage of pages are file backed and unmapped.
> 	This is to insure that a minimal amount of local pages is still available for
> 	file I/O even if the node is overallocated.
> 
> However, zone_page_state(zone, NR_FILE_PAGES) contain some non file backed pages
> (e.g. swapcache, buffer-head)
> 
> The right calculation is to use NR_{IN}ACTIVE_FILE.
> 
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Cc: Rik van Riel <riel@redhat.com>
> ---
>  mm/vmscan.c |   21 ++++++++++++++-------
>  1 file changed, 14 insertions(+), 7 deletions(-)
> 
> Index: b/mm/vmscan.c
> ===================================================================
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2397,6 +2397,7 @@ static int __zone_reclaim(struct zone *z
>  		.isolate_pages = isolate_pages_global,
>  	};
>  	unsigned long slab_reclaimable;
> +	long nr_unmapped_file_pages;
>  
>  	disable_swap_token();
>  	cond_resched();
> @@ -2409,9 +2410,11 @@ static int __zone_reclaim(struct zone *z
>  	reclaim_state.reclaimed_slab = 0;
>  	p->reclaim_state = &reclaim_state;
>  
> -	if (zone_page_state(zone, NR_FILE_PAGES) -
> -		zone_page_state(zone, NR_FILE_MAPPED) >
> -		zone->min_unmapped_pages) {
> +	nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
> +				 zone_page_state(zone, NR_ACTIVE_FILE) -
> +				 zone_page_state(zone, NR_FILE_MAPPED);

This can possibly go negative.

> +	if (nr_unmapped_file_pages > zone->min_unmapped_pages) {
>  		/*
>  		 * Free memory by calling shrink zone with increasing
>  		 * priorities until we have enough memory freed.
> @@ -2458,6 +2461,8 @@ int zone_reclaim(struct zone *zone, gfp_
>  {
>  	int node_id;
>  	int ret;
> +	long nr_unmapped_file_pages;
> +	long nr_slab_reclaimable;
>  
>  	/*
>  	 * Zone reclaim reclaims unmapped file backed pages and
> @@ -2469,10 +2474,12 @@ int zone_reclaim(struct zone *zone, gfp_
>  	 * if less than a specified percentage of the zone is used by
>  	 * unmapped file backed pages.
>  	 */
> -	if (zone_page_state(zone, NR_FILE_PAGES) -
> -	    zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_pages
> -	    && zone_page_state(zone, NR_SLAB_RECLAIMABLE)
> -			<= zone->min_slab_pages)
> +	nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
> +				 zone_page_state(zone, NR_ACTIVE_FILE) -
> +				 zone_page_state(zone, NR_FILE_MAPPED);

Ditto.

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>

> +	nr_slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
> +	if (nr_unmapped_file_pages <= zone->min_unmapped_pages &&
> +	    nr_slab_reclaimable <= zone->min_slab_pages)
>  		return 0;
>  
>  	if (zone_is_all_unreclaimable(zone))
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim
  2009-05-13  3:06 ` [PATCH 2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim KOSAKI Motohiro
  2009-05-13 13:35   ` Rik van Riel
  2009-05-14 19:57   ` Christoph Lameter
@ 2009-05-18  3:33   ` Wu Fengguang
  2 siblings, 0 replies; 45+ messages in thread
From: Wu Fengguang @ 2009-05-18  3:33 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter

On Wed, May 13, 2009 at 12:06:51PM +0900, KOSAKI Motohiro wrote:
> Subject: [PATCH] vmscan: drop PF_SWAPWRITE from zone_reclaim
> 
> PF_SWAPWRITE mean ignore write congestion. (see may_write_to_queue())
> 
> foreground reclaim shouldn't ignore it because to write congested device cause
> large IO lantency.
> it isn't better than remote node allocation.

Acked-by: Wu Fengguang <fengguang.wu@intel.com> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4] vmscan: zone_reclaim use may_swap
  2009-05-13  3:07 ` [PATCH 3/4] vmscan: zone_reclaim use may_swap KOSAKI Motohiro
                     ` (2 preceding siblings ...)
  2009-05-14 19:59   ` Christoph Lameter
@ 2009-05-18  3:35   ` Wu Fengguang
  3 siblings, 0 replies; 45+ messages in thread
From: Wu Fengguang @ 2009-05-18  3:35 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter

On Wed, May 13, 2009 at 12:07:30PM +0900, KOSAKI Motohiro wrote:
> Subject: [PATCH] vmscan: zone_reclaim use may_swap
> 
> Documentation/sysctl/vm.txt says
> 
> 	zone_reclaim_mode:
> 
> 	Zone_reclaim_mode allows someone to set more or less aggressive approaches to
> 	reclaim memory when a zone runs out of memory. If it is set to zero then no
> 	zone reclaim occurs. Allocations will be satisfied from other zones / nodes
> 	in the system.
> 
> 	This is value ORed together of
> 
> 	1	= Zone reclaim on
> 	2	= Zone reclaim writes dirty pages out
> 	4	= Zone reclaim swaps pages
> 
> 
> So, "(zone_reclaim_mode & RECLAIM_SWAP) == 0" mean we don't want to reclaim
> swap-backed pages. not mapped file.
> 
> Thus, may_swap is better than may_unmap.
> 
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Cc: Rik van Riel <riel@redhat.com>
> ---
>  mm/vmscan.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> Index: b/mm/vmscan.c
> ===================================================================
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2387,8 +2387,8 @@ static int __zone_reclaim(struct zone *z
>  	int priority;
>  	struct scan_control sc = {
>  		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
> -		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
> -		.may_swap = 1,
> +		.may_unmap = 1,
> +		.may_swap = !!(zone_reclaim_mode & RECLAIM_SWAP),
>  		.swap_cluster_max = max_t(unsigned long, nr_pages,
>  					SWAP_CLUSTER_MAX),
>  		.gfp_mask = gfp_mask,
> 

Acked-by: Wu Fengguang <fengguang.wu@intel.com> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/4] vmscan: change the number of the unmapped files in  zone reclaim
  2009-05-18  3:15   ` Wu Fengguang
@ 2009-05-18  3:35     ` KOSAKI Motohiro
  2009-05-18  3:53       ` Wu Fengguang
  0 siblings, 1 reply; 45+ messages in thread
From: KOSAKI Motohiro @ 2009-05-18  3:35 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter

>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2397,6 +2397,7 @@ static int __zone_reclaim(struct zone *z
>>               .isolate_pages = isolate_pages_global,
>>       };
>>       unsigned long slab_reclaimable;
>> +     long nr_unmapped_file_pages;
>>
>>       disable_swap_token();
>>       cond_resched();
>> @@ -2409,9 +2410,11 @@ static int __zone_reclaim(struct zone *z
>>       reclaim_state.reclaimed_slab = 0;
>>       p->reclaim_state = &reclaim_state;
>>
>> -     if (zone_page_state(zone, NR_FILE_PAGES) -
>> -             zone_page_state(zone, NR_FILE_MAPPED) >
>> -             zone->min_unmapped_pages) {
>> +     nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
>> +                              zone_page_state(zone, NR_ACTIVE_FILE) -
>> +                              zone_page_state(zone, NR_FILE_MAPPED);
>
> This can possibly go negative.

Is this a problem?
negative value mean almost pages are mapped. Thus

  (nr_unmapped_file_pages > zone->min_unmapped_pages)  => 0

is ok, I think.

>
>> +     if (nr_unmapped_file_pages > zone->min_unmapped_pages) {
>>               /*
>>                * Free memory by calling shrink zone with increasing
>>                * priorities until we have enough memory freed.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-13  3:08 ` [PATCH 4/4] zone_reclaim_mode is always 0 by default KOSAKI Motohiro
  2009-05-13 14:47   ` Rik van Riel
  2009-05-13 15:22   ` Robin Holt
@ 2009-05-18  3:49   ` Wu Fengguang
  2009-05-19  1:16     ` Zhang, Yanmin
  2009-05-19  2:53     ` KOSAKI Motohiro
  2009-05-18  9:09   ` Wu Fengguang
  3 siblings, 2 replies; 45+ messages in thread
From: Wu Fengguang @ 2009-05-18  3:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter,
	Zhang, Yanmin

On Wed, May 13, 2009 at 12:08:12PM +0900, KOSAKI Motohiro wrote:
> Subject: [PATCH] zone_reclaim_mode is always 0 by default
> 
> Current linux policy is, if the machine has large remote node distance,
>  zone_reclaim_mode is enabled by default because we've be able to assume to 
> large distance mean large server until recently.
> 
> Unfrotunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport
> memory controller. IOW it's NUMA from software view.
> 
> Some Core i7 machine has large remote node distance and zone_reclaim don't
> fit desktop and small file server. it cause performance degression.

I can confirm this, Yanmin recently ran into exactly such a
regression, which was fixed by manually disabling the zone reclaim
mode. So I guess you can safely add an

Tested-by: "Zhang, Yanmin" <yanmin.zhang@intel.com>

> Thus, zone_reclaim == 0 is better by default. sorry, HPC gusy. 
> you need to turn zone_reclaim_mode on manually now.
 
I guess the borderline will continue to blur up. It will be more
dependent on workloads instead of physical NUMA capabilities. So

Acked-by: Wu Fengguang <fengguang.wu@intel.com> 

> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Cc: Rik van Riel <riel@redhat.com>
> ---
>  mm/page_alloc.c |    7 -------
>  1 file changed, 7 deletions(-)
> 
> Index: b/mm/page_alloc.c
> ===================================================================
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2494,13 +2494,6 @@ static void build_zonelists(pg_data_t *p
>  		int distance = node_distance(local_node, node);
>  
>  		/*
> -		 * If another node is sufficiently far away then it is better
> -		 * to reclaim pages in a zone before going off node.
> -		 */
> -		if (distance > RECLAIM_DISTANCE)
> -			zone_reclaim_mode = 1;
> -
> -		/*
>  		 * We don't want to pressure a particular node.
>  		 * So adding penalty to the first node in same
>  		 * distance group to make it round-robin.
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim
  2009-05-18  3:35     ` KOSAKI Motohiro
@ 2009-05-18  3:53       ` Wu Fengguang
  2009-05-19  1:11         ` KOSAKI Motohiro
  0 siblings, 1 reply; 45+ messages in thread
From: Wu Fengguang @ 2009-05-18  3:53 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter

On Mon, May 18, 2009 at 11:35:31AM +0800, KOSAKI Motohiro wrote:
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -2397,6 +2397,7 @@ static int __zone_reclaim(struct zone *z
> >>               .isolate_pages = isolate_pages_global,
> >>       };
> >>       unsigned long slab_reclaimable;
> >> +     long nr_unmapped_file_pages;
> >>
> >>       disable_swap_token();
> >>       cond_resched();
> >> @@ -2409,9 +2410,11 @@ static int __zone_reclaim(struct zone *z
> >>       reclaim_state.reclaimed_slab = 0;
> >>       p->reclaim_state = &reclaim_state;
> >>
> >> -     if (zone_page_state(zone, NR_FILE_PAGES) -
> >> -             zone_page_state(zone, NR_FILE_MAPPED) >
> >> -             zone->min_unmapped_pages) {
> >> +     nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
> >> +                              zone_page_state(zone, NR_ACTIVE_FILE) -
> >> +                              zone_page_state(zone, NR_FILE_MAPPED);
> >
> > This can possibly go negative.
> 
> Is this a problem?
> negative value mean almost pages are mapped. Thus
> 
>   (nr_unmapped_file_pages > zone->min_unmapped_pages)  => 0
> 
> is ok, I think.

I wonder why you didn't get a gcc warning, because zone->min_unmapped_pages
is a "unsigned long".

Anyway, add a simple note to the code if it works *implicitly*?

Thanks,
Fengguang

> >
> >> +     if (nr_unmapped_file_pages > zone->min_unmapped_pages) {
> >>               /*
> >>                * Free memory by calling shrink zone with increasing
> >>                * priorities until we have enough memory freed.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-13  3:08 ` [PATCH 4/4] zone_reclaim_mode is always 0 by default KOSAKI Motohiro
                     ` (2 preceding siblings ...)
  2009-05-18  3:49   ` Wu Fengguang
@ 2009-05-18  9:09   ` Wu Fengguang
  3 siblings, 0 replies; 45+ messages in thread
From: Wu Fengguang @ 2009-05-18  9:09 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter

On Wed, May 13, 2009 at 12:08:12PM +0900, KOSAKI Motohiro wrote:
> Index: b/mm/page_alloc.c
> ===================================================================
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2494,13 +2494,6 @@ static void build_zonelists(pg_data_t *p
>  		int distance = node_distance(local_node, node);
>  
>  		/*
> -		 * If another node is sufficiently far away then it is better
> -		 * to reclaim pages in a zone before going off node.
> -		 */
> -		if (distance > RECLAIM_DISTANCE)
> -			zone_reclaim_mode = 1;
> -

Also remove the RECLAIM_DISTANCE definitions in
include/linux/topology.h and arch/ia64/include/asm/topology.h?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim
  2009-05-18  3:53       ` Wu Fengguang
@ 2009-05-19  1:11         ` KOSAKI Motohiro
  0 siblings, 0 replies; 45+ messages in thread
From: KOSAKI Motohiro @ 2009-05-19  1:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Rik van Riel,
	Christoph Lameter

> On Mon, May 18, 2009 at 11:35:31AM +0800, KOSAKI Motohiro wrote:
> > >> --- a/mm/vmscan.c
> > >> +++ b/mm/vmscan.c
> > >> @@ -2397,6 +2397,7 @@ static int __zone_reclaim(struct zone *z
> > >> ? ? ? ? ? ? ? .isolate_pages = isolate_pages_global,
> > >> ? ? ? };
> > >> ? ? ? unsigned long slab_reclaimable;
> > >> + ? ? long nr_unmapped_file_pages;
> > >>
> > >> ? ? ? disable_swap_token();
> > >> ? ? ? cond_resched();
> > >> @@ -2409,9 +2410,11 @@ static int __zone_reclaim(struct zone *z
> > >> ? ? ? reclaim_state.reclaimed_slab = 0;
> > >> ? ? ? p->reclaim_state = &reclaim_state;
> > >>
> > >> - ? ? if (zone_page_state(zone, NR_FILE_PAGES) -
> > >> - ? ? ? ? ? ? zone_page_state(zone, NR_FILE_MAPPED) >
> > >> - ? ? ? ? ? ? zone->min_unmapped_pages) {
> > >> + ? ? nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
> > >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?zone_page_state(zone, NR_ACTIVE_FILE) -
> > >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?zone_page_state(zone, NR_FILE_MAPPED);
> > >
> > > This can possibly go negative.
> > 
> > Is this a problem?
> > negative value mean almost pages are mapped. Thus
> > 
> >   (nr_unmapped_file_pages > zone->min_unmapped_pages)  => 0
> > 
> > is ok, I think.
> 
> I wonder why you didn't get a gcc warning, because zone->min_unmapped_pages
> is a "unsigned long".
> 
> Anyway, add a simple note to the code if it works *implicitly*?

hm, My gcc is wrong version? (gcc version 4.1.2 20070626 (Red Hat 4.1.2-14))
Anyway, you are right. thanks for good catch :)

incremental fixing patch is here.

Patch name: vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim-fix.patch
Applied after: vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch
---
 mm/vmscan.c |   12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2397,7 +2397,9 @@ static int __zone_reclaim(struct zone *z
 		.isolate_pages = isolate_pages_global,
 	};
 	unsigned long slab_reclaimable;
-	long nr_unmapped_file_pages;
+	unsigned long nr_file_pages;
+	unsigned long nr_mapped;
+	unsigned long nr_unmapped_file_pages = 0;
 
 	disable_swap_token();
 	cond_resched();
@@ -2410,9 +2412,11 @@ static int __zone_reclaim(struct zone *z
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
-				 zone_page_state(zone, NR_ACTIVE_FILE) -
-				 zone_page_state(zone, NR_FILE_MAPPED);
+	nr_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
+			zone_page_state(zone, NR_ACTIVE_FILE);
+	nr_mapped = zone_page_state(zone, NR_FILE_MAPPED);
+	if (likely(nr_file_pages >= nr_mapped))
+		nr_unmapped_file_pages = nr_file_pages - nr_mapped;
 
 	if (nr_unmapped_file_pages > zone->min_unmapped_pages) {
 		/*




^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-18  3:49   ` Wu Fengguang
@ 2009-05-19  1:16     ` Zhang, Yanmin
  2009-05-19  2:53     ` KOSAKI Motohiro
  1 sibling, 0 replies; 45+ messages in thread
From: Zhang, Yanmin @ 2009-05-19  1:16 UTC (permalink / raw)
  To: Wu, Fengguang, KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="gb2312", Size: 3000 bytes --]

>>-----Original Message-----
>>From: Wu, Fengguang
>>Sent: 2009Äê5ÔÂ18ÈÕ 11:49
>>To: KOSAKI Motohiro
>>Cc: LKML; linux-mm; Andrew Morton; Rik van Riel; Christoph Lameter; Zhang,
>>Yanmin
>>Subject: Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
>>
>>On Wed, May 13, 2009 at 12:08:12PM +0900, KOSAKI Motohiro wrote:
>>> Subject: [PATCH] zone_reclaim_mode is always 0 by default
>>>
>>> Current linux policy is, if the machine has large remote node distance,
>>>  zone_reclaim_mode is enabled by default because we've be able to assume to
>>> large distance mean large server until recently.
>>>
>>> Unfrotunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P
>>transport
>>> memory controller. IOW it's NUMA from software view.
>>>
>>> Some Core i7 machine has large remote node distance and zone_reclaim don't
>>> fit desktop and small file server. it cause performance degression.
>>
>>I can confirm this, Yanmin recently ran into exactly such a
>>regression, which was fixed by manually disabling the zone reclaim
>>mode. So I guess you can safely add an
[YM] Fengguang told the truth. One Nehalem machine has 12GB memory,
but there is always 2GB free although applications accesses lots of files.
Eventually we located the root cause as zone_reclaim_mode=1.

Acked.



>>
>>Tested-by: "Zhang, Yanmin" <yanmin.zhang@intel.com>
>>
>>> Thus, zone_reclaim == 0 is better by default. sorry, HPC gusy.
>>> you need to turn zone_reclaim_mode on manually now.
>>
>>I guess the borderline will continue to blur up. It will be more
>>dependent on workloads instead of physical NUMA capabilities. So
>>
>>Acked-by: Wu Fengguang <fengguang.wu@intel.com>
>>
>>> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>> Cc: Christoph Lameter <cl@linux-foundation.org>
>>> Cc: Rik van Riel <riel@redhat.com>
>>> ---
>>>  mm/page_alloc.c |    7 -------
>>>  1 file changed, 7 deletions(-)
>>>
>>> Index: b/mm/page_alloc.c
>>> ===================================================================
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -2494,13 +2494,6 @@ static void build_zonelists(pg_data_t *p
>>>  		int distance = node_distance(local_node, node);
>>>
>>>  		/*
>>> -		 * If another node is sufficiently far away then it is better
>>> -		 * to reclaim pages in a zone before going off node.
>>> -		 */
>>> -		if (distance > RECLAIM_DISTANCE)
>>> -			zone_reclaim_mode = 1;
>>> -
>>> -		/*
>>>  		 * We don't want to pressure a particular node.
>>>  		 * So adding penalty to the first node in same
>>>  		 * distance group to make it round-robin.
>>>
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-18  3:49   ` Wu Fengguang
  2009-05-19  1:16     ` Zhang, Yanmin
@ 2009-05-19  2:53     ` KOSAKI Motohiro
  2009-05-19  2:57       ` KOSAKI Motohiro
  2009-05-19  3:38       ` Zhang, Yanmin
  1 sibling, 2 replies; 45+ messages in thread
From: KOSAKI Motohiro @ 2009-05-19  2:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Rik van Riel,
	Christoph Lameter, Zhang, Yanmin

> On Wed, May 13, 2009 at 12:08:12PM +0900, KOSAKI Motohiro wrote:
> > Subject: [PATCH] zone_reclaim_mode is always 0 by default
> > 
> > Current linux policy is, if the machine has large remote node distance,
> >  zone_reclaim_mode is enabled by default because we've be able to assume to 
> > large distance mean large server until recently.
> > 
> > Unfrotunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport
> > memory controller. IOW it's NUMA from software view.
> > 
> > Some Core i7 machine has large remote node distance and zone_reclaim don't
> > fit desktop and small file server. it cause performance degression.
> 
> I can confirm this, Yanmin recently ran into exactly such a
> regression, which was fixed by manually disabling the zone reclaim
> mode. So I guess you can safely add an
> 
> Tested-by: "Zhang, Yanmin" <yanmin.zhang@intel.com>
> 
> > Thus, zone_reclaim == 0 is better by default. sorry, HPC gusy. 
> > you need to turn zone_reclaim_mode on manually now.
>  
> I guess the borderline will continue to blur up. It will be more
> dependent on workloads instead of physical NUMA capabilities. So
> 
> Acked-by: Wu Fengguang <fengguang.wu@intel.com> 

ok, I would explain zone reclaim design and performance tendency.

Firstly, we can make classification of linux eco system, roughly.
 - HPC
 - high-end server
 - volume server
 - desktop
 - embedded

it is separated by typical workload mainly.

Secondly, zone_reclaim mean "I strongly dislike remote node access than
disk access".
it is very fitting on HPC workload. it because 
  - HPC workload typically make the number of the same as cpus of processess (or thread).
    IOW, the workload typically use memory equally each node.
  - HPC workload is typically CPU bounded job. CPU migration is rare.
  - HPC workload is typically long lived. (possible >1 year)
    IOW, remote node allocation makes _very_ _very_ much remote node access.

but zone_reclaim don't fit typical server workload.
  - server workload often make thread pool and some thread is sleeping until
    a request receved.
    IOW, when thread waking-up, the thread might move another cpu. 
    node distance tendency don't make sense on weak cpu locality workload.

Plus, disk-cache is the file-server's identity. we shouldn't think it's not important.
Plus, DB software can consume almost system memory and (In general) RDB data makes
harder to split equally as hpc.

desktop workload is special. desktop peopole can run various workload beyond
our assumption. So, we shouldn't have any workload assumption to desktop people.
However, AFAIK almost desktop software use memory as UMA.

we don't need to care embedded. it is typically UMA.


IOW, the benefit of zone reclaim depend on "strong cpu locality" and
"workload is cpu bounded" and "thead is long lived".
but many workload don't fill above requirement. IOW, zone reclaim is
workload depended feature (as Wu said).


In general, the feature of workload depended don't fit default option.
we can't know end-user run what workload anyway.

Fortunately (or Unfortunately), typical workload and machine size had
significant mutuality.
Thus, the current default setting calculation had worked well in past days.

Now, it was breaked. What should we do?



Yanmin, We know 99% linux people use intel cpu and you are one of
most hard repeated testing guy in lkml and you have much test.
May I ask your tested machine and benchmark? 

if zone_reclaim=0 tendency workload is much than zone_reclaim=1 tendency workload,
 we can drop our afraid and we would prioritize your opinion, of cource.

thanks.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-15 10:51         ` Robin Holt
@ 2009-05-19  2:53           ` KOSAKI Motohiro
  2009-05-20 14:00             ` Robin Holt
  0 siblings, 1 reply; 45+ messages in thread
From: KOSAKI Motohiro @ 2009-05-19  2:53 UTC (permalink / raw)
  To: Robin Holt
  Cc: kosaki.motohiro, Christoph Lameter, LKML, linux-mm,
	Andrew Morton, Rik van Riel

Hi

> > Current linux policy is, zone_reclaim_mode is enabled by default if the machine
> > has large remote node distance. it's because we could assume that large distance 
> > mean large server until recently.
> > 
> > Unfortunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport
> > memory controller. IOW it's seen as NUMA from software view.
> > 
> > Some Core i7 machine has large remote node distance, but zone_reclaim don't
> > fit desktop and small file server. it cause performance degression.
> > 
> > Thus, zone_reclaim == 0 is better by default if the machine is small.
> 
> What if I had a node 0 with 32GB or 128GB of memory.  In that case,
> we would have 3GB for DMA32, 125GB for Normal and then a node 1 with
> 128GB.  I would suggest that zone reclaim would perform normally and
> be beneficial.
> 
> You are unfairly classifying this as a size of machine problem when it is
> really a problem with the underlying zone reclaim code being triggered
> due to imbalanced node/zones, part of which is due to a single node
> having multiple zones and those multiple zones setting up the conditions
> for extremely agressive reclaim.  In other words, you are putting a
> bandage in place to hide a problem on your particular hardware.
> 
> Can RECLAIM_DISTANCE be adjusted so your Ci7 boxes are no longer caught?
> Aren't 4 node Ci7 boxes soon to be readily available?  How are your apps
> different from my apps in that you are not impacted by node locality?
> Are you being too insensitive to node locality?  Conversely am I being
> too sensitive?
> 
> All that said, I would not stop this from going in.  I just think the
> selection criteria is rather random.  I think we know the condition we
> are trying to avoid which is a small Normal zone on one node and a larger
> Normal zone on another causing zone reclaim to be overly agressive.
> I don't know how to quantify "small" versus "large".  I would suggest
> that a node 0 with 16 or more GB should have zone reclaim on by default
> as well.  Can that be expressed in the selection criteria.

I post my opinion as another mail. please see it.








^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-19  2:53     ` KOSAKI Motohiro
@ 2009-05-19  2:57       ` KOSAKI Motohiro
  2009-05-19  3:38       ` Zhang, Yanmin
  1 sibling, 0 replies; 45+ messages in thread
From: KOSAKI Motohiro @ 2009-05-19  2:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Wu Fengguang, LKML, linux-mm, Andrew Morton,
	Rik van Riel, Christoph Lameter, Zhang, Yanmin

nit fix.

> In general, the feature of workload depended don't fit default option.
> we can't know end-user run what workload anyway.
> 
> Fortunately (or Unfortunately), typical workload and machine size had

typical workload and machine size and remote node distance

> significant mutuality.
> Thus, the current default setting calculation had worked well in past days.
> 
> Now, it was breaked. What should we do?
> 
> 
> 
> Yanmin, We know 99% linux people use intel cpu and you are one of
> most hard repeated testing guy in lkml and you have much test.
> May I ask your tested machine and benchmark? 
> 
> if zone_reclaim=0 tendency workload is much than zone_reclaim=1 tendency workload,
>  we can drop our afraid and we would prioritize your opinion, of cource.
> 
> thanks.
> 
> 




^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-19  2:53     ` KOSAKI Motohiro
  2009-05-19  2:57       ` KOSAKI Motohiro
@ 2009-05-19  3:38       ` Zhang, Yanmin
  2009-05-19  4:30         ` KOSAKI Motohiro
  1 sibling, 1 reply; 45+ messages in thread
From: Zhang, Yanmin @ 2009-05-19  3:38 UTC (permalink / raw)
  To: KOSAKI Motohiro, Wu, Fengguang
  Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Christoph Lameter

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="gb2312", Size: 4231 bytes --]

>>-----Original Message-----
>>From: KOSAKI Motohiro [mailto:kosaki.motohiro@jp.fujitsu.com]
>>Sent: 2009Äê5ÔÂ19ÈÕ 10:54
>>To: Wu, Fengguang
>>Cc: kosaki.motohiro@jp.fujitsu.com; LKML; linux-mm; Andrew Morton; Rik van
>>Riel; Christoph Lameter; Zhang, Yanmin
>>Subject: Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
>>
>>> On Wed, May 13, 2009 at 12:08:12PM +0900, KOSAKI Motohiro wrote:
>>> > Subject: [PATCH] zone_reclaim_mode is always 0 by default
>>> >
>>> > Current linux policy is, if the machine has large remote node distance,
>>> >  zone_reclaim_mode is enabled by default because we've be able to assume

>>
>>ok, I would explain zone reclaim design and performance tendency.
>>
>>Firstly, we can make classification of linux eco system, roughly.
>> - HPC
>> - high-end server
>> - volume server
>> - desktop
>> - embedded
>>
>>it is separated by typical workload mainly.
>>
>>Secondly, zone_reclaim mean "I strongly dislike remote node access than
>>disk access".
>>it is very fitting on HPC workload. it because
>>  - HPC workload typically make the number of the same as cpus of processess
>>(or thread).
>>    IOW, the workload typically use memory equally each node.
>>  - HPC workload is typically CPU bounded job. CPU migration is rare.
>>  - HPC workload is typically long lived. (possible >1 year)
>>    IOW, remote node allocation makes _very_ _very_ much remote node access.
>>
>>but zone_reclaim don't fit typical server workload.
>>  - server workload often make thread pool and some thread is sleeping until
>>    a request receved.
>>    IOW, when thread waking-up, the thread might move another cpu.
>>    node distance tendency don't make sense on weak cpu locality workload.
>>
>>Plus, disk-cache is the file-server's identity. we shouldn't think it's not
>>important.
>>Plus, DB software can consume almost system memory and (In general) RDB data
>>makes
>>harder to split equally as hpc.
>>
>>desktop workload is special. desktop peopole can run various workload beyond
>>our assumption. So, we shouldn't have any workload assumption to desktop
>>people.
>>However, AFAIK almost desktop software use memory as UMA.
>>
>>we don't need to care embedded. it is typically UMA.
>>
>>
>>IOW, the benefit of zone reclaim depend on "strong cpu locality" and
>>"workload is cpu bounded" and "thead is long lived".
>>but many workload don't fill above requirement. IOW, zone reclaim is
>>workload depended feature (as Wu said).
>>
>>
>>In general, the feature of workload depended don't fit default option.
>>we can't know end-user run what workload anyway.
>>
>>Fortunately (or Unfortunately), typical workload and machine size had
>>significant mutuality.
>>Thus, the current default setting calculation had worked well in past days.
[YM] Your analysis is clear and deep.

>>
>>Now, it was breaked. What should we do?
>>Yanmin, We know 99% linux people use intel cpu and you are one of
>>most hard repeated testing
[YM] It's very easy to reproduce them on my machines. :) Sometimes, because the 
issues only exist on machines with lots of cpu while other community developers
have no such environments. 

 guy in lkml and you have much test.
>>May I ask your tested machine and benchmark?
[YM] Usually I started lots of benchmark testing against the latest kernel, but 
as for this issue, it's reported by a customer firstly. The customer runs apache
on Nehalem machines to access lots of files. So the issue is an example of file 
server.

BTW, I found many test cases of fio have big drop after I upgraded BIOS of one 
Nehalem machine. By checking vmstat data, I found almost a half memory is always free. It's also related to zone_reclaim_mode because new BIOS changes the node
distance to a large value. I use numactl --interleave=all to walkaround the problem temporarily.

I have no HPC environment.

>>
>>if zone_reclaim=0 tendency workload is much than zone_reclaim=1 tendency
>>workload,
>> we can drop our afraid and we would prioritize your opinion, of cource.
So it seems only file servers have the issue currently.

Yanmin

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-19  3:38       ` Zhang, Yanmin
@ 2009-05-19  4:30         ` KOSAKI Motohiro
  2009-05-19  5:06           ` Zhang, Yanmin
  0 siblings, 1 reply; 45+ messages in thread
From: KOSAKI Motohiro @ 2009-05-19  4:30 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: kosaki.motohiro, Wu, Fengguang, LKML, linux-mm, Andrew Morton,
	Rik van Riel, Christoph Lameter

> >>-----Original Message-----
> >>From: KOSAKI Motohiro [mailto:kosaki.motohiro@jp.fujitsu.com]
> >>Sent: 2009ト\xF3\x16ヤツ19ネユ 10:54
> >>To: Wu, Fengguang
> >>Cc: kosaki.motohiro@jp.fujitsu.com; LKML; linux-mm; Andrew Morton; Rik van
> >>Riel; Christoph Lameter; Zhang, Yanmin
> >>Subject: Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
> >>
> >>> On Wed, May 13, 2009 at 12:08:12PM +0900, KOSAKI Motohiro wrote:
> >>> > Subject: [PATCH] zone_reclaim_mode is always 0 by default
> >>> >
> >>> > Current linux policy is, if the machine has large remote node distance,
> >>> >  zone_reclaim_mode is enabled by default because we've be able to assume
> 
> >>
> >>ok, I would explain zone reclaim design and performance tendency.
> >>
> >>Firstly, we can make classification of linux eco system, roughly.
> >> - HPC
> >> - high-end server
> >> - volume server
> >> - desktop
> >> - embedded
> >>
> >>it is separated by typical workload mainly.
> >>
> >>Secondly, zone_reclaim mean "I strongly dislike remote node access than
> >>disk access".
> >>it is very fitting on HPC workload. it because
> >>  - HPC workload typically make the number of the same as cpus of processess
> >>(or thread).
> >>    IOW, the workload typically use memory equally each node.
> >>  - HPC workload is typically CPU bounded job. CPU migration is rare.
> >>  - HPC workload is typically long lived. (possible >1 year)
> >>    IOW, remote node allocation makes _very_ _very_ much remote node access.
> >>
> >>but zone_reclaim don't fit typical server workload.
> >>  - server workload often make thread pool and some thread is sleeping until
> >>    a request receved.
> >>    IOW, when thread waking-up, the thread might move another cpu.
> >>    node distance tendency don't make sense on weak cpu locality workload.
> >>
> >>Plus, disk-cache is the file-server's identity. we shouldn't think it's not
> >>important.
> >>Plus, DB software can consume almost system memory and (In general) RDB data
> >>makes
> >>harder to split equally as hpc.
> >>
> >>desktop workload is special. desktop peopole can run various workload beyond
> >>our assumption. So, we shouldn't have any workload assumption to desktop
> >>people.
> >>However, AFAIK almost desktop software use memory as UMA.
> >>
> >>we don't need to care embedded. it is typically UMA.
> >>
> >>
> >>IOW, the benefit of zone reclaim depend on "strong cpu locality" and
> >>"workload is cpu bounded" and "thead is long lived".
> >>but many workload don't fill above requirement. IOW, zone reclaim is
> >>workload depended feature (as Wu said).
> >>
> >>
> >>In general, the feature of workload depended don't fit default option.
> >>we can't know end-user run what workload anyway.
> >>
> >>Fortunately (or Unfortunately), typical workload and machine size had
> >>significant mutuality.
> >>Thus, the current default setting calculation had worked well in past days.
> [YM] Your analysis is clear and deep.

Thanks!


> >>Now, it was breaked. What should we do?
> >>Yanmin, We know 99% linux people use intel cpu and you are one of
> >>most hard repeated testing
> [YM] It's very easy to reproduce them on my machines. :) Sometimes, because the 
> issues only exist on machines with lots of cpu while other community developers
> have no such environments. 
>
> 
>  guy in lkml and you have much test.
> >>May I ask your tested machine and benchmark?
> [YM] Usually I started lots of benchmark testing against the latest kernel, but 
> as for this issue, it's reported by a customer firstly. The customer runs apache
> on Nehalem machines to access lots of files. So the issue is an example of file 
> server.

hmmm. 
I'm surprised this report. I didn't know this problem. oh..

Actually, I don't think apache is only file server.
apache is one of killer application in linux. it run on very widely organization.
you think large machine don't run apache? I don't think so.



> BTW, I found many test cases of fio have big drop after I upgraded BIOS of one 
> Nehalem machine. By checking vmstat data, I found almost a half memory is always free. It's also related to zone_reclaim_mode because new BIOS changes the node
> distance to a large value. I use numactl --interleave=all to walkaround the problem temporarily.
> 
> I have no HPC environment.

Yeah, that's ok. I and cristoph have. My worries is my unknown workload become regression.
so, May I assume you run your benchmark both zonre reclaim 0 and 1 and you 
haven't seen regression by non-zone reclaim mode?
if so, it encourage very much to me.

if zone reclaim mode disabling don't have regression, I'll pushing to 
remove default zone reclaim mode completely again.


> >>if zone_reclaim=0 tendency workload is much than zone_reclaim=1 tendency
> >>workload,
> >> we can drop our afraid and we would prioritize your opinion, of cource.
> So it seems only file servers have the issue currently.






^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-19  4:30         ` KOSAKI Motohiro
@ 2009-05-19  5:06           ` Zhang, Yanmin
  2009-05-19  7:09             ` KOSAKI Motohiro
  0 siblings, 1 reply; 45+ messages in thread
From: Zhang, Yanmin @ 2009-05-19  5:06 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Wu, Fengguang, LKML, linux-mm, Andrew Morton, Rik van Riel,
	Christoph Lameter

>>-----Original Message-----
>>From: KOSAKI Motohiro [mailto:kosaki.motohiro@jp.fujitsu.com]
>>Sent: 2009年5月19日 12:31
>>To: Zhang, Yanmin
>>Cc: kosaki.motohiro@jp.fujitsu.com; Wu, Fengguang; LKML; linux-mm; Andrew
>>Morton; Rik van Riel; Christoph Lameter
>>Subject: Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
>>
>>> >>-----Original Message-----
>>> >>From: KOSAKI Motohiro [mailto:kosaki.motohiro@jp.fujitsu.com]
>>> >>Sent: 2009ト・ヤツ19ネユ 10:54
>>> >>To: Wu, Fengguang
>>> >>Cc: kosaki.motohiro@jp.fujitsu.com; LKML; linux-mm; Andrew Morton; Rik van
>>> >>Riel; Christoph Lameter; Zhang, Yanmin
>>> >>Subject: Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
>>> >>
>>> >>> On Wed, May 13, 2009 at 12:08:12PM +0900, KOSAKI Motohiro wrote:
>>> >>> > Subject: [PATCH] zone_reclaim_mode is always 0 by default
>>> >>> >
>>> >>> > Current linux policy is, if the machine has large remote node distance,
>>> >>> >  zone_reclaim_mode is enabled by default because we've be able to assume
>>> >>Fortunately (or Unfortunately), typical workload and machine size had
>>> >>significant mutuality.
>>> >>Thus, the current default setting calculation had worked well in past days.
>>> [YM] Your analysis is clear and deep.
>>
>>Thanks!
>>
>>
>>> >>Now, it was breaked. What should we do?
>>> >>Yanmin, We know 99% linux people use intel cpu and you are one of
>>> >>most hard repeated testing
>>> [YM] It's very easy to reproduce them on my machines. :) Sometimes, because
>>the
>>> issues only exist on machines with lots of cpu while other community
>>developers
>>> have no such environments.
>>>
>>>
>>>  guy in lkml and you have much test.
>>> >>May I ask your tested machine and benchmark?
>>> [YM] Usually I started lots of benchmark testing against the latest kernel,
>>but
>>> as for this issue, it's reported by a customer firstly. The customer runs
>>apache
>>> on Nehalem machines to access lots of files. So the issue is an example of
>>file
>>> server.
>>
>>hmmm.
>>I'm surprised this report. I didn't know this problem. oh..
[YM] Did you run file server workload on such NUMA machine with
 zone_reclaim_mode=1? If all nodes have the same memory, the behavior is
obvious.


>>
>>Actually, I don't think apache is only file server.
>>apache is one of killer application in linux. it run on very widely
>>organization.
[YM] I know that. Apache could support document, ecommerce, and lots of other
usage models. What I mean is one of customers hit it with their
workload.


>>you think large machine don't run apache? I don't think so.
>>
>>
>>
>>> BTW, I found many test cases of fio have big drop after I upgraded BIOS of
>>one
>>> Nehalem machine. By checking vmstat data, I found almost a half memory is
>>always free. It's also related to zone_reclaim_mode because new BIOS changes
>>the node
>>> distance to a large value. I use numactl --interleave=all to walkaround the
>>problem temporarily.
>>>
>>> I have no HPC environment.
>>
>>Yeah, that's ok. I and cristoph have. My worries is my unknown workload become
>>regression.
>>so, May I assume you run your benchmark both zonre reclaim 0 and 1 and you
>>haven't seen regression by non-zone reclaim mode?
[YM] what is non-zone reclaim mode? When zone_reclaim_mode=0?
I didn't do that intentionally. Currently I just make sure FIO has a big drop
 when zone_reclaim_mode=1. I might test it with other benchmarks on 2 Nehalem machines.


>>if so, it encourage very much to me.
>>
>>if zone reclaim mode disabling don't have regression, I'll pushing to
>>remove default zone reclaim mode completely again.
[YM] I run lots of benchmarks, but it doesn't mean I run all benchmarks, especially
no HPC. 


>>
>>
>>> >>if zone_reclaim=0 tendency workload is much than zone_reclaim=1 tendency
>>> >>workload,
>>> >> we can drop our afraid and we would prioritize your opinion, of cource.
>>> So it seems only file servers have the issue currently.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-19  5:06           ` Zhang, Yanmin
@ 2009-05-19  7:09             ` KOSAKI Motohiro
  2009-05-19  7:15               ` Zhang, Yanmin
  0 siblings, 1 reply; 45+ messages in thread
From: KOSAKI Motohiro @ 2009-05-19  7:09 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: kosaki.motohiro, Wu, Fengguang, LKML, linux-mm, Andrew Morton,
	Rik van Riel, Christoph Lameter

Hi

> >>> >>Now, it was breaked. What should we do?
> >>> >>Yanmin, We know 99% linux people use intel cpu and you are one of
> >>> >>most hard repeated testing
> >>> [YM] It's very easy to reproduce them on my machines. :) Sometimes, because
> >>the
> >>> issues only exist on machines with lots of cpu while other community
> >>developers
> >>> have no such environments.
> >>>
> >>>
> >>>  guy in lkml and you have much test.
> >>> >>May I ask your tested machine and benchmark?
> >>> [YM] Usually I started lots of benchmark testing against the latest kernel,
> >>but
> >>> as for this issue, it's reported by a customer firstly. The customer runs
> >>apache
> >>> on Nehalem machines to access lots of files. So the issue is an example of
> >>file
> >>> server.
> >>
> >>hmmm.
> >>I'm surprised this report. I didn't know this problem. oh..
> [YM] Did you run file server workload on such NUMA machine with
>  zone_reclaim_mode=1? If all nodes have the same memory, the behavior is
> obvious.

I missed your point. I agree file server case is obvious. but I don't
think anybody oppose this.



> >>Actually, I don't think apache is only file server.
> >>apache is one of killer application in linux. it run on very widely
> >>organization.
> [YM] I know that. Apache could support document, ecommerce, and lots of other
> usage models. What I mean is one of customers hit it with their
> workload.

hmhm, ok.


> >>you think large machine don't run apache? I don't think so.
> >>
> >>
> >>
> >>> BTW, I found many test cases of fio have big drop after I upgraded BIOS of
> >>one
> >>> Nehalem machine. By checking vmstat data, I found almost a half memory is
> >>always free. It's also related to zone_reclaim_mode because new BIOS changes
> >>the node
> >>> distance to a large value. I use numactl --interleave=all to walkaround the
> >>problem temporarily.
> >>>
> >>> I have no HPC environment.
> >>
> >>Yeah, that's ok. I and cristoph have. My worries is my unknown workload become
> >>regression.
> >>so, May I assume you run your benchmark both zonre reclaim 0 and 1 and you
> >>haven't seen regression by non-zone reclaim mode?
> [YM] what is non-zone reclaim mode? When zone_reclaim_mode=0?
> I didn't do that intentionally. Currently I just make sure FIO has a big drop
>  when zone_reclaim_mode=1. I might test it with other benchmarks on 2 Nehalem machines.

May I ask what is FIO?
File IO?


> >>if so, it encourage very much to me.
> >>
> >>if zone reclaim mode disabling don't have regression, I'll pushing to
> >>remove default zone reclaim mode completely again.
> [YM] I run lots of benchmarks, but it doesn't mean I run all benchmarks, especially
> no HPC. 

Of cource. nobody can run all benchmark in the world :)



> >>> >>if zone_reclaim=0 tendency workload is much than zone_reclaim=1 tendency
> >>> >>workload,
> >>> >> we can drop our afraid and we would prioritize your opinion, of cource.
> >>> So it seems only file servers have the issue currently.
> 




^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-19  7:09             ` KOSAKI Motohiro
@ 2009-05-19  7:15               ` Zhang, Yanmin
  0 siblings, 0 replies; 45+ messages in thread
From: Zhang, Yanmin @ 2009-05-19  7:15 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Wu, Fengguang, LKML, linux-mm, Andrew Morton, Rik van Riel,
	Christoph Lameter

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="gb2312", Size: 1666 bytes --]

>>-----Original Message-----
>>From: KOSAKI Motohiro [mailto:kosaki.motohiro@jp.fujitsu.com]
>>Sent: 2009Äê5ÔÂ19ÈÕ 15:10
>>To: Zhang, Yanmin
>>Cc: kosaki.motohiro@jp.fujitsu.com; Wu, Fengguang; LKML; linux-mm; Andrew
>>Morton; Rik van Riel; Christoph Lameter
>>Subject: Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
>>
>>Hi
>>
>>> >>> >>Now, it was breaked. What should we do?
>>> >>> >>Yanmin, We know 99% linux people use intel cpu and you are one of
>>> >>> >>most hard repeated testing
>>> >>> [YM] It's very easy to reproduce them on my machines. :) Sometimes, because
>>> >>the
>>> >>> issues only exist on machines with lots of cpu while other community
>>> >>developers
>>> >>> have no such environments.
>>> >>>
>>> >>>
>>> >>>  guy in lkml and you have much test.
>>> >>> >>May I ask your tested machine and benchmark?
>>> >>> [YM] Usually I started lots of benchmark testing against the latest
>>> >>
>>> >>Yeah, that's ok. I and cristoph have. My worries is my unknown workload
>>become
>>> >>regression.
>>> >>so, May I assume you run your benchmark both zonre reclaim 0 and 1 and you
>>> >>haven't seen regression by non-zone reclaim mode?
>>> [YM] what is non-zone reclaim mode? When zone_reclaim_mode=0?
>>> I didn't do that intentionally. Currently I just make sure FIO has a big drop
>>>  when zone_reclaim_mode=1. I might test it with other benchmarks on 2 Nehalem
>>machines.
>>

>>May I ask what is FIO?
>>File IO?
[YM] fio is a tool to test I/O. Jens Axboe is the author.

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-19  2:53           ` KOSAKI Motohiro
@ 2009-05-20 14:00             ` Robin Holt
  2009-05-21  2:44               ` KOSAKI Motohiro
  0 siblings, 1 reply; 45+ messages in thread
From: Robin Holt @ 2009-05-20 14:00 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Robin Holt, Christoph Lameter, LKML, linux-mm, Andrew Morton,
	Rik van Riel

On Tue, May 19, 2009 at 11:53:44AM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> > > Current linux policy is, zone_reclaim_mode is enabled by default if the machine
> > > has large remote node distance. it's because we could assume that large distance 
> > > mean large server until recently.
> > > 
> > > Unfortunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport
> > > memory controller. IOW it's seen as NUMA from software view.
> > > 
> > > Some Core i7 machine has large remote node distance, but zone_reclaim don't
> > > fit desktop and small file server. it cause performance degression.
> > > 
> > > Thus, zone_reclaim == 0 is better by default if the machine is small.
> > 
> > What if I had a node 0 with 32GB or 128GB of memory.  In that case,
> > we would have 3GB for DMA32, 125GB for Normal and then a node 1 with
> > 128GB.  I would suggest that zone reclaim would perform normally and
> > be beneficial.
> > 
> > You are unfairly classifying this as a size of machine problem when it is
> > really a problem with the underlying zone reclaim code being triggered
> > due to imbalanced node/zones, part of which is due to a single node
> > having multiple zones and those multiple zones setting up the conditions
> > for extremely agressive reclaim.  In other words, you are putting a
> > bandage in place to hide a problem on your particular hardware.
> > 
> > Can RECLAIM_DISTANCE be adjusted so your Ci7 boxes are no longer caught?
> > Aren't 4 node Ci7 boxes soon to be readily available?  How are your apps
> > different from my apps in that you are not impacted by node locality?
> > Are you being too insensitive to node locality?  Conversely am I being
> > too sensitive?
> > 
> > All that said, I would not stop this from going in.  I just think the
> > selection criteria is rather random.  I think we know the condition we
> > are trying to avoid which is a small Normal zone on one node and a larger
> > Normal zone on another causing zone reclaim to be overly agressive.
> > I don't know how to quantify "small" versus "large".  I would suggest
> > that a node 0 with 16 or more GB should have zone reclaim on by default
> > as well.  Can that be expressed in the selection criteria.
> 
> I post my opinion as another mail. please see it.

I don't think you addressed my actual question.  How much of this is
a result of having a node where 1/4 of the memory is in the 'Normal'
zone and 3/4 is in the DMA32 zone?  How much is due to the imbalance
between Node 0 'Normal' and Node 1 'Normal'?  Shouldn't that type of
sanity check be used for turning on zone reclaim instead of some random
number of nodes.  Even with 128 nodes and 256 cpus, I _NEVER_ see the
system swapping out before allocating off node so I can certainly not
reproduce the situation you are seeing.

The imbalance I have seen was when I had two small memory nodes and two
large memory nodes and then oversubscribed memory.  In that situation,
I noticed that the apps on the small memory nodes were more frequently
impacted.  This unfairness made sense to me and seemed perfectly
reasonable.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-20 14:00             ` Robin Holt
@ 2009-05-21  2:44               ` KOSAKI Motohiro
  2009-05-21 13:31                 ` Christoph Lameter
  0 siblings, 1 reply; 45+ messages in thread
From: KOSAKI Motohiro @ 2009-05-21  2:44 UTC (permalink / raw)
  To: Robin Holt
  Cc: kosaki.motohiro, Christoph Lameter, LKML, linux-mm,
	Andrew Morton, Rik van Riel

> On Tue, May 19, 2009 at 11:53:44AM +0900, KOSAKI Motohiro wrote:
> > Hi
> > 
> > > > Current linux policy is, zone_reclaim_mode is enabled by default if the machine
> > > > has large remote node distance. it's because we could assume that large distance 
> > > > mean large server until recently.
> > > > 
> > > > Unfortunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P transport
> > > > memory controller. IOW it's seen as NUMA from software view.
> > > > 
> > > > Some Core i7 machine has large remote node distance, but zone_reclaim don't
> > > > fit desktop and small file server. it cause performance degression.
> > > > 
> > > > Thus, zone_reclaim == 0 is better by default if the machine is small.
> > > 
> > > What if I had a node 0 with 32GB or 128GB of memory.  In that case,
> > > we would have 3GB for DMA32, 125GB for Normal and then a node 1 with
> > > 128GB.  I would suggest that zone reclaim would perform normally and
> > > be beneficial.
> > > 
> > > You are unfairly classifying this as a size of machine problem when it is
> > > really a problem with the underlying zone reclaim code being triggered
> > > due to imbalanced node/zones, part of which is due to a single node
> > > having multiple zones and those multiple zones setting up the conditions
> > > for extremely agressive reclaim.  In other words, you are putting a
> > > bandage in place to hide a problem on your particular hardware.
> > > 
> > > Can RECLAIM_DISTANCE be adjusted so your Ci7 boxes are no longer caught?
> > > Aren't 4 node Ci7 boxes soon to be readily available?  How are your apps
> > > different from my apps in that you are not impacted by node locality?
> > > Are you being too insensitive to node locality?  Conversely am I being
> > > too sensitive?
> > > 
> > > All that said, I would not stop this from going in.  I just think the
> > > selection criteria is rather random.  I think we know the condition we
> > > are trying to avoid which is a small Normal zone on one node and a larger
> > > Normal zone on another causing zone reclaim to be overly agressive.
> > > I don't know how to quantify "small" versus "large".  I would suggest
> > > that a node 0 with 16 or more GB should have zone reclaim on by default
> > > as well.  Can that be expressed in the selection criteria.
> > 
> > I post my opinion as another mail. please see it.
> 
> I don't think you addressed my actual question.  How much of this is
> a result of having a node where 1/4 of the memory is in the 'Normal'
> zone and 3/4 is in the DMA32 zone?  How much is due to the imbalance
> between Node 0 'Normal' and Node 1 'Normal'?  Shouldn't that type of
> sanity check be used for turning on zone reclaim instead of some random
> number of nodes.

I can't catch up your message. Can you post your patch?
Can you explain your sanity check?

Now, I decide to remove "nr_online_nodes >= 4" condition.
Apache regression is really non-sense.

> Even with 128 nodes and 256 cpus, I _NEVER_ see the
> system swapping out before allocating off node so I can certainly not
> reproduce the situation you are seeing.

hmhm. but I don't think we can assume hpc workload.


> 
> The imbalance I have seen was when I had two small memory nodes and two
> large memory nodes and then oversubscribed memory.  In that situation,
> I noticed that the apps on the small memory nodes were more frequently
> impacted.  This unfairness made sense to me and seemed perfectly
> reasonable.


The node imbalancing is ok. example, typical linux init script makes many deamon process
to node0, we can't avoid it and it don't make strange behavior.

but zone imbalancing is bad. I don't want discuss all item again. but you
can google about inter zone reclaim issue instead.




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-21  2:44               ` KOSAKI Motohiro
@ 2009-05-21 13:31                 ` Christoph Lameter
  2009-05-21 13:57                   ` Robin Holt
  2009-05-24 13:44                   ` KOSAKI Motohiro
  0 siblings, 2 replies; 45+ messages in thread
From: Christoph Lameter @ 2009-05-21 13:31 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Robin Holt, LKML, linux-mm, Andrew Morton, Rik van Riel

On Thu, 21 May 2009, KOSAKI Motohiro wrote:

> I can't catch up your message. Can you post your patch?
> Can you explain your sanity check?
>
> Now, I decide to remove "nr_online_nodes >= 4" condition.
> Apache regression is really non-sense.

Not sure what that means? Apache regresses with zone reclaim? My
measurements when we introduced zone reclaim showed just the opposite
because Apache would get node local memory and thus run faster. You can
screw this up of course if you load the system so high that the apache
processes are tossed around by the scheduler. Then the node local
allocation may be worse than round robin because all the pages allocated
by a process are now on one node if the scheduler moves the
process to a remote node then all accesses are penalized.

> > Even with 128 nodes and 256 cpus, I _NEVER_ see the
> > system swapping out before allocating off node so I can certainly not
> > reproduce the situation you are seeing.
>
> hmhm. but I don't think we can assume hpc workload.

System swapping due to zone reclaim? zone reclaim only reclaims unmapped
pages so it will not swap. Maybe some bug crept in in the recent changes?
Or you overrode the defaults for zone reclaim?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-21 13:31                 ` Christoph Lameter
@ 2009-05-21 13:57                   ` Robin Holt
  2009-05-24 13:44                   ` KOSAKI Motohiro
  1 sibling, 0 replies; 45+ messages in thread
From: Robin Holt @ 2009-05-21 13:57 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, Robin Holt, LKML, linux-mm, Andrew Morton, Rik van Riel

On Thu, May 21, 2009 at 09:31:08AM -0400, Christoph Lameter wrote:
> On Thu, 21 May 2009, KOSAKI Motohiro wrote:
> 
> > I can't catch up your message. Can you post your patch?
> > Can you explain your sanity check?
> >
> > Now, I decide to remove "nr_online_nodes >= 4" condition.
> > Apache regression is really non-sense.
> 
> Not sure what that means? Apache regresses with zone reclaim? My
> measurements when we introduced zone reclaim showed just the opposite
> because Apache would get node local memory and thus run faster. You can
> screw this up of course if you load the system so high that the apache
> processes are tossed around by the scheduler. Then the node local
> allocation may be worse than round robin because all the pages allocated
> by a process are now on one node if the scheduler moves the
> process to a remote node then all accesses are penalized.

I think the point Kosaki is trying to make is that reclaim happens really
aggressively for processes on node 0 versus node 1.  Maybe I am clinging
too strongly to one of the earlier posts, but that is what I read between
the lines.

That frequent reclaim is impacting allocations when he would rather they
skip the reclaim and go off node.  Again, it sounds like he prefers tuning
the default to what works best for him.  I don't too strongly disagree,
as long as the default isn't being changed capriciously.

I have always expected that NUMA boxes had reasons for preferring node
locality.  Maybe I misunderstand.  Maybe Ci7 is special and does not
have any impact for off socket references.  I would be surprised by that
after reading to literature, but I have not tested latency or bandwidth
on one so I can not say.

Personally, it sounds like if I had a box configured as his is, I would
use a cpuset to restrict most memory hungry things from using cpus
on node 0 and leave that as the small 'junk processes' cpu.  Maybe even
restrict things like cron etc to that corner of the system.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
  2009-05-21 13:31                 ` Christoph Lameter
  2009-05-21 13:57                   ` Robin Holt
@ 2009-05-24 13:44                   ` KOSAKI Motohiro
  1 sibling, 0 replies; 45+ messages in thread
From: KOSAKI Motohiro @ 2009-05-24 13:44 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Robin Holt, LKML, linux-mm, Andrew Morton, Rik van Riel


sorry I missed this mail

> > > Even with 128 nodes and 256 cpus, I _NEVER_ see the
> > > system swapping out before allocating off node so I can certainly not
> > > reproduce the situation you are seeing.
> >
> > hmhm. but I don't think we can assume hpc workload.
> 
> System swapping due to zone reclaim? zone reclaim only reclaims unmapped
> pages so it will not swap. Maybe some bug crept in in the recent changes?
> Or you overrode the defaults for zone reclaim?

I guess he use zone_reclaim_mode=7 or similar.

However, I have to explain recent zone reclaim change. current zone reclaim is

 1. zone reclaim can make high order reclaim (by hanns)
 2. determine file-backed page by get_scan_ratio

it mean, high order allocation makes lumpy zone reclaim. and shrink_inactive_list()
don't care may_swap. then, zone_reclaim_mode=1 can makes swap-out if your
driver makes high order allocation request.




^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2009-05-24 13:45 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-13  3:06 [PATCH 0/4] various zone_reclaim cleanup KOSAKI Motohiro
2009-05-13  3:06 ` [PATCH 1/4] vmscan: change the number of the unmapped files in zone reclaim KOSAKI Motohiro
2009-05-13 13:31   ` Rik van Riel
2009-05-14 19:52   ` Christoph Lameter
2009-05-18  3:15   ` Wu Fengguang
2009-05-18  3:35     ` KOSAKI Motohiro
2009-05-18  3:53       ` Wu Fengguang
2009-05-19  1:11         ` KOSAKI Motohiro
2009-05-13  3:06 ` [PATCH 2/4] vmscan: drop PF_SWAPWRITE from zone_reclaim KOSAKI Motohiro
2009-05-13 13:35   ` Rik van Riel
2009-05-14 19:57   ` Christoph Lameter
2009-05-18  3:33   ` Wu Fengguang
2009-05-13  3:07 ` [PATCH 3/4] vmscan: zone_reclaim use may_swap KOSAKI Motohiro
2009-05-13 11:26   ` Johannes Weiner
2009-05-13 14:43   ` Rik van Riel
2009-05-14 19:59   ` Christoph Lameter
2009-05-18  3:35   ` Wu Fengguang
2009-05-13  3:08 ` [PATCH 4/4] zone_reclaim_mode is always 0 by default KOSAKI Motohiro
2009-05-13 14:47   ` Rik van Riel
2009-05-14  8:20     ` KOSAKI Motohiro
2009-05-14 11:48       ` Robin Holt
2009-05-14 12:02         ` KOSAKI Motohiro
2009-05-13 15:22   ` Robin Holt
2009-05-14 20:05     ` Christoph Lameter
2009-05-14 20:23       ` Rik van Riel
2009-05-14 20:31         ` Christoph Lameter
2009-05-15  1:02       ` KOSAKI Motohiro
2009-05-15 10:51         ` Robin Holt
2009-05-19  2:53           ` KOSAKI Motohiro
2009-05-20 14:00             ` Robin Holt
2009-05-21  2:44               ` KOSAKI Motohiro
2009-05-21 13:31                 ` Christoph Lameter
2009-05-21 13:57                   ` Robin Holt
2009-05-24 13:44                   ` KOSAKI Motohiro
2009-05-15 18:01         ` Christoph Lameter
2009-05-18  3:49   ` Wu Fengguang
2009-05-19  1:16     ` Zhang, Yanmin
2009-05-19  2:53     ` KOSAKI Motohiro
2009-05-19  2:57       ` KOSAKI Motohiro
2009-05-19  3:38       ` Zhang, Yanmin
2009-05-19  4:30         ` KOSAKI Motohiro
2009-05-19  5:06           ` Zhang, Yanmin
2009-05-19  7:09             ` KOSAKI Motohiro
2009-05-19  7:15               ` Zhang, Yanmin
2009-05-18  9:09   ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).