linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH resend^2] mm: increase RECLAIM_DISTANCE to 30
@ 2011-04-11  8:19 KOSAKI Motohiro
  2011-04-11 21:19 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: KOSAKI Motohiro @ 2011-04-11  8:19 UTC (permalink / raw)
  To: LKML, linux-mm, Andrew Morton, Christoph Lameter, David Rientjes,
	KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro

Recently, Robert Mueller reported zone_reclaim_mode doesn't work
properly on his new NUMA server (Dual Xeon E5520 + Intel S5520UR MB).
He is using Cyrus IMAPd and it's built on a very traditional
single-process model.

  * a master process which reads config files and manages the other
    process
  * multiple imapd processes, one per connection
  * multiple pop3d processes, one per connection
  * multiple lmtpd processes, one per connection
  * periodical "cleanup" processes.

Then, there are thousands of independent processes. The problem is,
recent Intel motherboard turn on zone_reclaim_mode by default and
traditional prefork model software don't work fine on it.
Unfortunatelly, Such model is still typical one even though 21th
century. We can't ignore them.

This patch raise zone_reclaim_mode threshold to 30. 30 don't have
specific meaning. but 20 mean one-hop QPI/Hypertransport and such
relatively cheap 2-4 socket machine are often used for tradiotional
server as above. The intention is, their machine don't use
zone_reclaim_mode.

Note: ia64 and Power have arch specific RECLAIM_DISTANCE definition.
then this patch doesn't change such high-end NUMA machine behavior.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/topology.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index b91a40e..fc839bf 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -60,7 +60,7 @@ int arch_update_cpu_topology(void);
  * (in whatever arch specific measurement units returned by node_distance())
  * then switch on zone reclaim on boot.
  */
-#define RECLAIM_DISTANCE 20
+#define RECLAIM_DISTANCE 30
 #endif
 #ifndef PENALTY_FOR_NODE_WITH_CPUS
 #define PENALTY_FOR_NODE_WITH_CPUS	(1)
-- 
1.7.3.1




^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH resend^2] mm: increase RECLAIM_DISTANCE to 30
  2011-04-11  8:19 [PATCH resend^2] mm: increase RECLAIM_DISTANCE to 30 KOSAKI Motohiro
@ 2011-04-11 21:19 ` Andrew Morton
  2011-04-12  0:59   ` KOSAKI Motohiro
  2011-04-11 21:29 ` Dave Hansen
  2011-04-13  0:16 ` David Rientjes
  2 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2011-04-11 21:19 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki

On Mon, 11 Apr 2011 17:19:31 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Recently, Robert Mueller reported zone_reclaim_mode doesn't work

It's time for some nagging.  

I'm trying to work out what the user-visible effect of this problem
was, but it isn't described in the changelog and there is no link to
any report and not even a Reported-by: or a Cc: and a search for Robert
in linux-mm and linux-kernel turned up blank.

> properly on his new NUMA server (Dual Xeon E5520 + Intel S5520UR MB).
> He is using Cyrus IMAPd and it's built on a very traditional
> single-process model.
> 
>   * a master process which reads config files and manages the other
>     process
>   * multiple imapd processes, one per connection
>   * multiple pop3d processes, one per connection
>   * multiple lmtpd processes, one per connection
>   * periodical "cleanup" processes.
> 
> Then, there are thousands of independent processes. The problem is,
> recent Intel motherboard turn on zone_reclaim_mode by default and
> traditional prefork model software don't work fine on it.
> Unfortunatelly, Such model is still typical one even though 21th
> century. We can't ignore them.
> 
> This patch raise zone_reclaim_mode threshold to 30. 30 don't have
> specific meaning. but 20 mean one-hop QPI/Hypertransport and such
> relatively cheap 2-4 socket machine are often used for tradiotional
> server as above. The intention is, their machine don't use
> zone_reclaim_mode.
> 
> Note: ia64 and Power have arch specific RECLAIM_DISTANCE definition.
> then this patch doesn't change such high-end NUMA machine behavior.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Acked-by: Christoph Lameter <cl@linux.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/topology.h |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index b91a40e..fc839bf 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -60,7 +60,7 @@ int arch_update_cpu_topology(void);
>   * (in whatever arch specific measurement units returned by node_distance())
>   * then switch on zone reclaim on boot.
>   */
> -#define RECLAIM_DISTANCE 20
> +#define RECLAIM_DISTANCE 30

Any time we tweak a magic number to improve one platform, we risk
causing deterioration on another.  Do we know that this risk is low
with this patch?

Also, what are we doing setting

	zone_relaim_mode = 1;

when we have nice enumerated constants for this?  It should be

	zone_relaim_mode = RECLAIM_ZONE;

or, pedantically but clearer:

	zone_relaim_mode = RECLAIM_ZONE & !RECLAIM_WRITE & !RECLAIM_SWAP;



Finally, we shouldn't be playing these guessing games in the kernel at
all - we'll always get it wrong for some platforms and for some
workloads.  zone_reclaim_mdoe is tunable at runtime and we should be
encouraging administrators, integrators and distros to *use* this
ability.  That might mean having to write some tools to empirically
determine the optimum setting for a particular machine.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH resend^2] mm: increase RECLAIM_DISTANCE to 30
  2011-04-11  8:19 [PATCH resend^2] mm: increase RECLAIM_DISTANCE to 30 KOSAKI Motohiro
  2011-04-11 21:19 ` Andrew Morton
@ 2011-04-11 21:29 ` Dave Hansen
  2011-04-12  1:01   ` KOSAKI Motohiro
  2011-04-13  0:22   ` David Rientjes
  2011-04-13  0:16 ` David Rientjes
  2 siblings, 2 replies; 15+ messages in thread
From: Dave Hansen @ 2011-04-11 21:29 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Christoph Lameter, David Rientjes,
	KAMEZAWA Hiroyuki, Chris McDermott

On Mon, 2011-04-11 at 17:19 +0900, KOSAKI Motohiro wrote:
> This patch raise zone_reclaim_mode threshold to 30. 30 don't have
> specific meaning. but 20 mean one-hop QPI/Hypertransport and such
> relatively cheap 2-4 socket machine are often used for tradiotional
> server as above. The intention is, their machine don't use
> zone_reclaim_mode.

I know specifically of pieces of x86 hardware that set the information
in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode
behavior which that implies.

They've done performance testing and run very large and scary benchmarks
to make sure that they _want_ this turned on.  What this means for them
is that they'll probably be de-optimized, at least on newer versions of
the kernel.

If you want to do this for particular systems, maybe _that_'s what we
should do.  Have a list of specific configurations that need the
defaults overridden either because they're buggy, or they have an
unusual hardware configuration not really reflected in the distance
table.

-- Dave


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH resend^2] mm: increase RECLAIM_DISTANCE to 30
  2011-04-11 21:19 ` Andrew Morton
@ 2011-04-12  0:59   ` KOSAKI Motohiro
  0 siblings, 0 replies; 15+ messages in thread
From: KOSAKI Motohiro @ 2011-04-12  0:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, LKML, linux-mm, Christoph Lameter,
	David Rientjes, KAMEZAWA Hiroyuki

> On Mon, 11 Apr 2011 17:19:31 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > Recently, Robert Mueller reported zone_reclaim_mode doesn't work
> 
> It's time for some nagging.  
> 
> I'm trying to work out what the user-visible effect of this problem
> was, but it isn't described in the changelog and there is no link to
> any report and not even a Reported-by: or a Cc: and a search for Robert
> in linux-mm and linux-kernel turned up blank.

Here.
	http://lkml.org/lkml/2010/9/12/236


> 
> > properly on his new NUMA server (Dual Xeon E5520 + Intel S5520UR MB).
> > He is using Cyrus IMAPd and it's built on a very traditional
> > single-process model.
> > 
> >   * a master process which reads config files and manages the other
> >     process
> >   * multiple imapd processes, one per connection
> >   * multiple pop3d processes, one per connection
> >   * multiple lmtpd processes, one per connection
> >   * periodical "cleanup" processes.
> > 
> > Then, there are thousands of independent processes. The problem is,
> > recent Intel motherboard turn on zone_reclaim_mode by default and
> > traditional prefork model software don't work fine on it.
> > Unfortunatelly, Such model is still typical one even though 21th
> > century. We can't ignore them.
> > 
> > This patch raise zone_reclaim_mode threshold to 30. 30 don't have
> > specific meaning. but 20 mean one-hop QPI/Hypertransport and such
> > relatively cheap 2-4 socket machine are often used for tradiotional
> > server as above. The intention is, their machine don't use
> > zone_reclaim_mode.
> > 
> > Note: ia64 and Power have arch specific RECLAIM_DISTANCE definition.
> > then this patch doesn't change such high-end NUMA machine behavior.
> > 
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Acked-by: Christoph Lameter <cl@linux.com>
> > Acked-by: David Rientjes <rientjes@google.com>
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  include/linux/topology.h |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/include/linux/topology.h b/include/linux/topology.h
> > index b91a40e..fc839bf 100644
> > --- a/include/linux/topology.h
> > +++ b/include/linux/topology.h
> > @@ -60,7 +60,7 @@ int arch_update_cpu_topology(void);
> >   * (in whatever arch specific measurement units returned by node_distance())
> >   * then switch on zone reclaim on boot.
> >   */
> > -#define RECLAIM_DISTANCE 20
> > +#define RECLAIM_DISTANCE 30
> 
> Any time we tweak a magic number to improve one platform, we risk
> causing deterioration on another.  Do we know that this risk is low
> with this patch?

In last thread, Robert Mueller who bug reporter explained he is only using
mere commodity whitebox hardware and very common workload.
Therefore, we agreed benefit is bigger than negative. IOW, mere whitebox
are used lots than special purpose one.



> Also, what are we doing setting
> 
> 	zone_relaim_mode = 1;
> 
> when we have nice enumerated constants for this?  It should be
> 
> 	zone_relaim_mode = RECLAIM_ZONE;
> 
> or, pedantically but clearer:
> 
> 	zone_relaim_mode = RECLAIM_ZONE & !RECLAIM_WRITE & !RECLAIM_SWAP;

Indeed.



>From 0298eb3256bd17eb88584a90917be749bd8d2c98 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Tue, 12 Apr 2011 09:40:38 +0900
Subject: [PATCH 2/2] mm: Don't use hardcoded constant for zone_reclaim_mode

Initially, zone_reclaim_mode was introduced by commit 9eeff2395e3
(Zone reclaim: Reclaim logic). At that time, it was 0/1 boolean
variable.

Next, commit 1b2ffb7896 (Zone reclaim: Allow modification of zone reclaim
behavior) changed it to bitmask. But, page_alloc.c still use it as
boolean. It is slightly harder to read.

Let's convert it.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/swap.h |    5 +++++
 mm/page_alloc.c      |    2 +-
 mm/vmscan.c          |    5 -----
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 384eb5f..078ba25 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -266,6 +266,11 @@ extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
 #ifdef CONFIG_NUMA
+#define RECLAIM_OFF 0
+#define RECLAIM_ZONE (1<<0)	/* Run shrink_inactive_list on the zone */
+#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
+#define RECLAIM_SWAP (1<<2)	/* Swap pages out during reclaim */
+
 extern int zone_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e400779..be8607e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2982,7 +2982,7 @@ static void build_zonelists(pg_data_t *pgdat)
 		 * to reclaim pages in a zone before going off node.
 		 */
 		if (distance > RECLAIM_DISTANCE)
-			zone_reclaim_mode = 1;
+			zone_reclaim_mode = RECLAIM_ZONE;
 
 		/*
 		 * We don't want to pressure a particular node.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0c5a3d6..019e00c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2893,11 +2893,6 @@ module_init(kswapd_init)
  */
 int zone_reclaim_mode __read_mostly;
 
-#define RECLAIM_OFF 0
-#define RECLAIM_ZONE (1<<0)	/* Run shrink_inactive_list on the zone */
-#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
-#define RECLAIM_SWAP (1<<2)	/* Swap pages out during reclaim */
-
 /*
  * Priority for ZONE_RECLAIM. This determines the fraction of pages
  * of a node considered for each zone_reclaim. 4 scans 1/16th of
-- 
1.7.3.1




^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH resend^2] mm: increase RECLAIM_DISTANCE to 30
  2011-04-11 21:29 ` Dave Hansen
@ 2011-04-12  1:01   ` KOSAKI Motohiro
  2011-04-12  2:27     ` Dave Hansen
  2011-04-13  0:22   ` David Rientjes
  1 sibling, 1 reply; 15+ messages in thread
From: KOSAKI Motohiro @ 2011-04-12  1:01 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton,
	Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki,
	Chris McDermott

> On Mon, 2011-04-11 at 17:19 +0900, KOSAKI Motohiro wrote:
> > This patch raise zone_reclaim_mode threshold to 30. 30 don't have
> > specific meaning. but 20 mean one-hop QPI/Hypertransport and such
> > relatively cheap 2-4 socket machine are often used for tradiotional
> > server as above. The intention is, their machine don't use
> > zone_reclaim_mode.
> 
> I know specifically of pieces of x86 hardware that set the information
> in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode
> behavior which that implies.

Which hardware?
The reason why now we decided to change default is the original bug reporter was using
mere commodity whitebox hardware and very common workload. 
If it is enough commotidy, we have to concern it. but if it is special, we don't care it.
Hardware vendor should fix a firmware.


> They've done performance testing and run very large and scary benchmarks
> to make sure that they _want_ this turned on.  What this means for them
> is that they'll probably be de-optimized, at least on newer versions of
> the kernel.
> 
> If you want to do this for particular systems, maybe _that_'s what we
> should do.  Have a list of specific configurations that need the
> defaults overridden either because they're buggy, or they have an
> unusual hardware configuration not really reflected in the distance
> table.

No. It's no my demand. It's demand from commodity hardware. you can fix
your company firmware, but we can't change commodity ones.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH resend^2] mm: increase RECLAIM_DISTANCE to 30
  2011-04-12  1:01   ` KOSAKI Motohiro
@ 2011-04-12  2:27     ` Dave Hansen
  2011-04-12  7:25       ` KOSAKI Motohiro
  2011-05-24 20:07       ` Andrew Morton
  0 siblings, 2 replies; 15+ messages in thread
From: Dave Hansen @ 2011-04-12  2:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Christoph Lameter, David Rientjes,
	KAMEZAWA Hiroyuki, Chris McDermott

On Tue, 2011-04-12 at 10:01 +0900, KOSAKI Motohiro wrote:
> > On Mon, 2011-04-11 at 17:19 +0900, KOSAKI Motohiro wrote:
> > > This patch raise zone_reclaim_mode threshold to 30. 30 don't have
> > > specific meaning. but 20 mean one-hop QPI/Hypertransport and such
> > > relatively cheap 2-4 socket machine are often used for tradiotional
> > > server as above. The intention is, their machine don't use
> > > zone_reclaim_mode.
> > 
> > I know specifically of pieces of x86 hardware that set the information
> > in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode
> > behavior which that implies.
> 
> Which hardware?

I'd have to go digging for the model numbers.  I just remember having
discussions with folks about it a couple of years ago.  My memory isn't
what it used to be. :)

> The reason why now we decided to change default is the original bug reporter was using
> mere commodity whitebox hardware and very common workload. 
> If it is enough commotidy, we have to concern it. but if it is special, we don't care it.
> Hardware vendor should fix a firmware.

Yeah, it's certainly a "simple" fix.  The distance tables can certainly
be adjusted easily, and worked around pretty trivially with boot
options.  If we decide to change the generic case, let's also make sure
that we put something else in place simultaneously that is nice for the
folks that don't want it changed.  Maybe something DMI-based that digs
for model numbers?

I'll go try and dig for some more specifics on the hardware so we at
least have something to test on.

-- Dave


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH resend^2] mm: increase RECLAIM_DISTANCE to 30
  2011-04-12  2:27     ` Dave Hansen
@ 2011-04-12  7:25       ` KOSAKI Motohiro
  2011-05-24 20:07       ` Andrew Morton
  1 sibling, 0 replies; 15+ messages in thread
From: KOSAKI Motohiro @ 2011-04-12  7:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton,
	Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki,
	Chris McDermott

> On Tue, 2011-04-12 at 10:01 +0900, KOSAKI Motohiro wrote:
> > > On Mon, 2011-04-11 at 17:19 +0900, KOSAKI Motohiro wrote:
> > > > This patch raise zone_reclaim_mode threshold to 30. 30 don't have
> > > > specific meaning. but 20 mean one-hop QPI/Hypertransport and such
> > > > relatively cheap 2-4 socket machine are often used for tradiotional
> > > > server as above. The intention is, their machine don't use
> > > > zone_reclaim_mode.
> > > 
> > > I know specifically of pieces of x86 hardware that set the information
> > > in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode
> > > behavior which that implies.
> > 
> > Which hardware?
> 
> I'd have to go digging for the model numbers.  I just remember having
> discussions with folks about it a couple of years ago.  My memory isn't
> what it used to be. :)

O.K.

> 
> > The reason why now we decided to change default is the original bug reporter was using
> > mere commodity whitebox hardware and very common workload. 
> > If it is enough commotidy, we have to concern it. but if it is special, we don't care it.
> > Hardware vendor should fix a firmware.
> 
> Yeah, it's certainly a "simple" fix.  The distance tables can certainly
> be adjusted easily, and worked around pretty trivially with boot
> options.  If we decide to change the generic case, let's also make sure
> that we put something else in place simultaneously that is nice for the
> folks that don't want it changed.  Maybe something DMI-based that digs
> for model numbers?

That pretty makes sense. If you can find exacl model number, I'm fully
assist this portion.


> I'll go try and dig for some more specifics on the hardware so we at
> least have something to test on.

Thank you!




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH resend^2] mm: increase RECLAIM_DISTANCE to 30
  2011-04-11  8:19 [PATCH resend^2] mm: increase RECLAIM_DISTANCE to 30 KOSAKI Motohiro
  2011-04-11 21:19 ` Andrew Morton
  2011-04-11 21:29 ` Dave Hansen
@ 2011-04-13  0:16 ` David Rientjes
  2011-04-13  0:26   ` Rob Mueller
  2 siblings, 1 reply; 15+ messages in thread
From: David Rientjes @ 2011-04-13  0:16 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Christoph Lameter,
	KAMEZAWA Hiroyuki, Robert Mueller

On Mon, 11 Apr 2011, KOSAKI Motohiro wrote:

> Recently, Robert Mueller reported zone_reclaim_mode doesn't work
> properly on his new NUMA server (Dual Xeon E5520 + Intel S5520UR MB).
> He is using Cyrus IMAPd and it's built on a very traditional
> single-process model.
> 

Let's add Robert to the cc to see if this is still an issue, it hasn't 
been re-reported in over six months.

>   * a master process which reads config files and manages the other
>     process
>   * multiple imapd processes, one per connection
>   * multiple pop3d processes, one per connection
>   * multiple lmtpd processes, one per connection
>   * periodical "cleanup" processes.
> 
> Then, there are thousands of independent processes. The problem is,
> recent Intel motherboard turn on zone_reclaim_mode by default and
> traditional prefork model software don't work fine on it.
> Unfortunatelly, Such model is still typical one even though 21th
> century. We can't ignore them.
> 
> This patch raise zone_reclaim_mode threshold to 30. 30 don't have
> specific meaning. but 20 mean one-hop QPI/Hypertransport and such
> relatively cheap 2-4 socket machine are often used for tradiotional
> server as above. The intention is, their machine don't use
> zone_reclaim_mode.
> 
> Note: ia64 and Power have arch specific RECLAIM_DISTANCE definition.
> then this patch doesn't change such high-end NUMA machine behavior.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Acked-by: Christoph Lameter <cl@linux.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/topology.h |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index b91a40e..fc839bf 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -60,7 +60,7 @@ int arch_update_cpu_topology(void);
>   * (in whatever arch specific measurement units returned by node_distance())
>   * then switch on zone reclaim on boot.
>   */
> -#define RECLAIM_DISTANCE 20
> +#define RECLAIM_DISTANCE 30
>  #endif
>  #ifndef PENALTY_FOR_NODE_WITH_CPUS
>  #define PENALTY_FOR_NODE_WITH_CPUS	(1)

I ack'd this because we use it internally and it never got pushed 
upstream, but I'm curious why it isn't being done only in the x86 
topology.h file if we're concerned with specific commodity hardware and 
implicitly affecting all architectures other than ia64 and powerpc.

It would be even better to get rid of RECLAIM_DISTANCE entirely since its 
fundamentally flawed without sanely configured SLITs per the ACPI spec, 
which specifies that these distances should be relative to the local 
distance of 10.  In this case, it would mean that the VM should prefer 
zone reclaim over remote node allocations when that memory takes 2x longer 
to access.  If your system doesn't have a SLIT, then remote nodes are 
assumed, possibly incorrectly, to have a latency 2x that of the local 
access.

We could probably do this if we measured the remote node memory access 
latency at boot and then define a threshold for turning zone_reclaim_mode 
on rather than relying on the distance at all.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH resend^2] mm: increase RECLAIM_DISTANCE to 30
  2011-04-11 21:29 ` Dave Hansen
  2011-04-12  1:01   ` KOSAKI Motohiro
@ 2011-04-13  0:22   ` David Rientjes
  2011-04-13  0:49     ` Dave Hansen
  1 sibling, 1 reply; 15+ messages in thread
From: David Rientjes @ 2011-04-13  0:22 UTC (permalink / raw)
  To: Dave Hansen
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton,
	Christoph Lameter, KAMEZAWA Hiroyuki, Chris McDermott

On Mon, 11 Apr 2011, Dave Hansen wrote:

> > This patch raise zone_reclaim_mode threshold to 30. 30 don't have
> > specific meaning. but 20 mean one-hop QPI/Hypertransport and such
> > relatively cheap 2-4 socket machine are often used for tradiotional
> > server as above. The intention is, their machine don't use
> > zone_reclaim_mode.
> 
> I know specifically of pieces of x86 hardware that set the information
> in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode
> behavior which that implies.
> 

That doesn't seem like an argument against this patch, it's an improper 
configuration unless the remote memory access has a latency of 2.1x that 
of a local access between those two nodes.  If that's the case, then it's 
accurately following the ACPI spec and the VM has made its policy decision 
to enable zone_reclaim_mode as a result.  I'm surprised that they'd play 
with their BIOS to enable this by default, those, when it's an easily 
tunable sysctl.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH resend^2] mm: increase RECLAIM_DISTANCE to 30
  2011-04-13  0:16 ` David Rientjes
@ 2011-04-13  0:26   ` Rob Mueller
  0 siblings, 0 replies; 15+ messages in thread
From: Rob Mueller @ 2011-04-13  0:26 UTC (permalink / raw)
  To: David Rientjes, KOSAKI Motohiro
  Cc: LKML, linux-mm, Andrew Morton, Christoph Lameter, KAMEZAWA Hiroyuki


>> Recently, Robert Mueller reported zone_reclaim_mode doesn't work
>> properly on his new NUMA server (Dual Xeon E5520 + Intel S5520UR MB).
>> He is using Cyrus IMAPd and it's built on a very traditional
>> single-process model.
>>
>
> Let's add Robert to the cc to see if this is still an issue, it hasn't
> been re-reported in over six months.

We definitely still set this in /etc/sysctl.conf on every imap server 
machine:

vm.zone_reclaim_mode = 0

I believe it still defaults to 1 otherwise. What I haven't tested is if 
leaving it at 1 still causes problems. It definitely DID previously cause 
big problems (I think that was around 2.6.34 or so).

http://blog.fastmail.fm/2010/09/15/default-zone_reclaim_mode-1-on-numa-kernel-is-bad-for-fileemailweb-servers/

I'll try changing it to 1 on a machine for 4 hours, see if it makes a 
noticeable difference and report back.

Rob


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH resend^2] mm: increase RECLAIM_DISTANCE to 30
  2011-04-13  0:22   ` David Rientjes
@ 2011-04-13  0:49     ` Dave Hansen
  2011-04-13  0:56       ` David Rientjes
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2011-04-13  0:49 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton,
	Christoph Lameter, KAMEZAWA Hiroyuki, Chris McDermott

On Tue, 2011-04-12 at 17:22 -0700, David Rientjes wrote:
> On Mon, 11 Apr 2011, Dave Hansen wrote:
> I know specifically of pieces of x86 hardware that set the information
> > in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode
> > behavior which that implies.
> 
> That doesn't seem like an argument against this patch, it's an improper 
> configuration unless the remote memory access has a latency of 2.1x that 
> of a local access between those two nodes.  If that's the case, then it's 
> accurately following the ACPI spec and the VM has made its policy decision 
> to enable zone_reclaim_mode as a result.

Heh, if the kernel broke on every system that didn't follow _some_ spec,
it wouldn't boot in very many places.

When you have a hammer, everything looks like a nail.  When you're a
BIOS developer, you start thwacking at the kernel with munged ACPI
tables instead of boot options.  Folks do this in the real world, and I
think if we can't put their names and addresses next to the code that
works around this, we might as well put the DMI strings of their
hardware. :) 

-- Dave


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH resend^2] mm: increase RECLAIM_DISTANCE to 30
  2011-04-13  0:49     ` Dave Hansen
@ 2011-04-13  0:56       ` David Rientjes
  0 siblings, 0 replies; 15+ messages in thread
From: David Rientjes @ 2011-04-13  0:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton,
	Christoph Lameter, KAMEZAWA Hiroyuki, Chris McDermott

On Tue, 12 Apr 2011, Dave Hansen wrote:

> > That doesn't seem like an argument against this patch, it's an improper 
> > configuration unless the remote memory access has a latency of 2.1x that 
> > of a local access between those two nodes.  If that's the case, then it's 
> > accurately following the ACPI spec and the VM has made its policy decision 
> > to enable zone_reclaim_mode as a result.
> 
> Heh, if the kernel broke on every system that didn't follow _some_ spec,
> it wouldn't boot in very many places.
> 
> When you have a hammer, everything looks like a nail.  When you're a
> BIOS developer, you start thwacking at the kernel with munged ACPI
> tables instead of boot options.  Folks do this in the real world, and I
> think if we can't put their names and addresses next to the code that
> works around this, we might as well put the DMI strings of their
> hardware. :) 
> 

That's why I suggested doing away with RECLAIM_DISTANCE entirely, 
otherwise we are relying on the SLIT always being correct when we know 
it's not; the policy decision in the kernel as it stands now is that we 
want to enable zone_reclaim_mode when remote memory access takes longer 
than 2x that of a local access (3x with KOSAKI-san's patch), which is 
something we can actually measure at boot rather than relying on the BIOS 
at all.  Then we don't have to bother with DMI strings for specific pieces 
of hardware and can remove the existing ia64 and powerpc special cases.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH resend^2] mm: increase RECLAIM_DISTANCE to 30
  2011-04-12  2:27     ` Dave Hansen
  2011-04-12  7:25       ` KOSAKI Motohiro
@ 2011-05-24 20:07       ` Andrew Morton
  2011-05-24 20:24         ` David Rientjes
  2011-05-24 20:37         ` Dave Hansen
  1 sibling, 2 replies; 15+ messages in thread
From: Andrew Morton @ 2011-05-24 20:07 UTC (permalink / raw)
  To: Dave Hansen
  Cc: KOSAKI Motohiro, LKML, linux-mm, Christoph Lameter,
	David Rientjes, KAMEZAWA Hiroyuki, Chris McDermott

On Mon, 11 Apr 2011 19:27:21 -0700
Dave Hansen <dave@linux.vnet.ibm.com> wrote:

> On Tue, 2011-04-12 at 10:01 +0900, KOSAKI Motohiro wrote:
> > > On Mon, 2011-04-11 at 17:19 +0900, KOSAKI Motohiro wrote:
> > > > This patch raise zone_reclaim_mode threshold to 30. 30 don't have
> > > > specific meaning. but 20 mean one-hop QPI/Hypertransport and such
> > > > relatively cheap 2-4 socket machine are often used for tradiotional
> > > > server as above. The intention is, their machine don't use
> > > > zone_reclaim_mode.
> > > 
> > > I know specifically of pieces of x86 hardware that set the information
> > > in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode
> > > behavior which that implies.
> > 
> > Which hardware?
> 
> I'd have to go digging for the model numbers.  I just remember having
> discussions with folks about it a couple of years ago.  My memory isn't
> what it used to be. :)
> 
> > The reason why now we decided to change default is the original bug reporter was using
> > mere commodity whitebox hardware and very common workload. 
> > If it is enough commotidy, we have to concern it. but if it is special, we don't care it.
> > Hardware vendor should fix a firmware.
> 
> Yeah, it's certainly a "simple" fix.  The distance tables can certainly
> be adjusted easily, and worked around pretty trivially with boot
> options.  If we decide to change the generic case, let's also make sure
> that we put something else in place simultaneously that is nice for the
> folks that don't want it changed.  Maybe something DMI-based that digs
> for model numbers?
> 
> I'll go try and dig for some more specifics on the hardware so we at
> least have something to test on.
> 

How's that digging coming along?

I'm pretty wobbly about this patch.  Perhaps we should set
RECLAIM_DISTANCE to pi/2 or something, to force people to correctly set
the dang thing in initscripts.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH resend^2] mm: increase RECLAIM_DISTANCE to 30
  2011-05-24 20:07       ` Andrew Morton
@ 2011-05-24 20:24         ` David Rientjes
  2011-05-24 20:37         ` Dave Hansen
  1 sibling, 0 replies; 15+ messages in thread
From: David Rientjes @ 2011-05-24 20:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Hansen, KOSAKI Motohiro, LKML, linux-mm, Christoph Lameter,
	KAMEZAWA Hiroyuki, Chris McDermott

On Tue, 24 May 2011, Andrew Morton wrote:

> How's that digging coming along?
> 
> I'm pretty wobbly about this patch.  Perhaps we should set
> RECLAIM_DISTANCE to pi/2 or something, to force people to correctly set
> the dang thing in initscripts.
> 

I think RECLAIM_DISTANCE as a constant is the wrong approach to begin 
with.

The distance between nodes as specified by the SLIT imply that a node with 
a distance of 30 has a relative distance of 3x than a local memory access.  
That's not the same as implying the latency is 3x greater, though, since 
the SLIT is based on relative distances according to ACPI 3.0.  In other 
words, it's perfectly legitimate for node 0 to have a distance of 20 and 
30 to nodes 1 and 2, respectively, if their memory access latencies are 5x 
and 10x greater, while the SLIT would remain unchanged if the latencies 
were 2x and 3x.

So basing zone reclaim by default off of a relative distance specified in 
the SLIT is wrong to begin with, and that's probably why we notice that 
the old value of 20 doesn't suffice on some machines anymore.

As I suggested earlier, I think it would be far better to actually measure 
the memory access latency to remote nodes at boot to determine whether to 
prefer zone reclaim or not rather than basing it off a false SLIT 
assumption.

Notice also that the machines that this patch was proposed for probably 
also didn't have a custom SLIT to begin with and so remote nodes get a 
default value of REMOTE_DISTANCE, which equaled RECLAIM_DISTANCE.  The 
same effect would have been achieved if you had decreased REMOTE_DISTANCE 
to 15.

We probably shouldn't be using SLIT distances at all within the kernel.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH resend^2] mm: increase RECLAIM_DISTANCE to 30
  2011-05-24 20:07       ` Andrew Morton
  2011-05-24 20:24         ` David Rientjes
@ 2011-05-24 20:37         ` Dave Hansen
  1 sibling, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2011-05-24 20:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, LKML, linux-mm, Christoph Lameter,
	David Rientjes, KAMEZAWA Hiroyuki, Chris McDermott

On Tue, 2011-05-24 at 13:07 -0700, Andrew Morton wrote:
> On Mon, 11 Apr 2011 19:27:21 -0700
> Dave Hansen <dave@linux.vnet.ibm.com> wrote:
> > I'll go try and dig for some more specifics on the hardware so we at
> > least have something to test on.
> 
> How's that digging coming along?
> 
> I'm pretty wobbly about this patch.  Perhaps we should set
> RECLAIM_DISTANCE to pi/2 or something, to force people to correctly set
> the dang thing in initscripts.

The original change in the hardware tables was for the benefit of a
benchmark.  Said benchmark isn't going to get run on mainline until the
next batch of enterprise distros drops, at which point the hardware
where this was done will be irrelevant for the benchmark.  I'm sure any
new hardware will just set this distance to another yet arbitrary value
to make the kernel do what it wants. :)

Also, when the hardware got _set_ to this initially, I complained.  So,
I guess I'm getting my way now, with this patch.  I'm cool with it:

Acked-by: Dave Hansen <dave@linux.vnet.ibm.com>

-- Dave


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-05-24 20:37 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-11  8:19 [PATCH resend^2] mm: increase RECLAIM_DISTANCE to 30 KOSAKI Motohiro
2011-04-11 21:19 ` Andrew Morton
2011-04-12  0:59   ` KOSAKI Motohiro
2011-04-11 21:29 ` Dave Hansen
2011-04-12  1:01   ` KOSAKI Motohiro
2011-04-12  2:27     ` Dave Hansen
2011-04-12  7:25       ` KOSAKI Motohiro
2011-05-24 20:07       ` Andrew Morton
2011-05-24 20:24         ` David Rientjes
2011-05-24 20:37         ` Dave Hansen
2011-04-13  0:22   ` David Rientjes
2011-04-13  0:49     ` Dave Hansen
2011-04-13  0:56       ` David Rientjes
2011-04-13  0:16 ` David Rientjes
2011-04-13  0:26   ` Rob Mueller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).