All of lore.kernel.org
 help / color / mirror / Atom feed
* zone state overhead
@ 2010-09-28  5:08 Shaohua Li
  2010-09-28 12:39 ` Christoph Lameter
  2010-10-08 15:29 ` Mel Gorman
  0 siblings, 2 replies; 65+ messages in thread
From: Shaohua Li @ 2010-09-28  5:08 UTC (permalink / raw)
  To: linux-mm; +Cc: cl

In a 4 socket 64 CPU system, zone_nr_free_pages() takes about 5% ~ 10% cpu time
according to perf when memory pressure is high. The workload does something
like:
for i in `seq 1 $nr_cpu`
do
        create_sparse_file $SPARSE_FILE-$i $((10 * mem / nr_cpu))
        $USEMEM -f $SPARSE_FILE-$i -j 4096 --readonly $((10 * mem / nr_cpu)) &
done
this simply reads a sparse file for each CPU. Apparently the
zone->percpu_drift_mark is too big, and guess zone_page_state_snapshot() makes
a lot of cache bounce for ->vm_stat_diff[]. below is the zoneinfo for reference.
Is there any way to reduce the overhead?

Node 3, zone   Normal
pages free     2055926
        min      1441
        low      1801
        high     2161
        scanned  0
        spanned  2097152
        present  2068480
  vm stats threshold: 98

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-09-28  5:08 zone state overhead Shaohua Li
@ 2010-09-28 12:39 ` Christoph Lameter
  2010-09-28 13:30   ` Mel Gorman
  2010-10-08 15:29 ` Mel Gorman
  1 sibling, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2010-09-28 12:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-mm, Mel Gorman

On Tue, 28 Sep 2010, Shaohua Li wrote:

> In a 4 socket 64 CPU system, zone_nr_free_pages() takes about 5% ~ 10% cpu time
> according to perf when memory pressure is high. The workload does something
> like:
> for i in `seq 1 $nr_cpu`
> do
>         create_sparse_file $SPARSE_FILE-$i $((10 * mem / nr_cpu))
>         $USEMEM -f $SPARSE_FILE-$i -j 4096 --readonly $((10 * mem / nr_cpu)) &
> done
> this simply reads a sparse file for each CPU. Apparently the
> zone->percpu_drift_mark is too big, and guess zone_page_state_snapshot() makes
> a lot of cache bounce for ->vm_stat_diff[]. below is the zoneinfo for reference.
> Is there any way to reduce the overhead?

I guess Mel could reduce the percpu_drift_mark? Or tune that with a
reduction in the stat_threshold? The less the count can deviate the less
the percpu_drift_mark has to be and the less we need calls to
zone_page_state_snapshot.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-09-28 12:39 ` Christoph Lameter
@ 2010-09-28 13:30   ` Mel Gorman
  2010-09-28 13:40     ` Christoph Lameter
  2010-09-29  4:02     ` David Rientjes
  0 siblings, 2 replies; 65+ messages in thread
From: Mel Gorman @ 2010-09-28 13:30 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Shaohua Li, linux-mm

On Tue, Sep 28, 2010 at 07:39:24AM -0500, Christoph Lameter wrote:
> On Tue, 28 Sep 2010, Shaohua Li wrote:
> 
> > In a 4 socket 64 CPU system, zone_nr_free_pages() takes about 5% ~ 10% cpu time
> > according to perf when memory pressure is high. The workload does something
> > like:
> > for i in `seq 1 $nr_cpu`
> > do
> >         create_sparse_file $SPARSE_FILE-$i $((10 * mem / nr_cpu))
> >         $USEMEM -f $SPARSE_FILE-$i -j 4096 --readonly $((10 * mem / nr_cpu)) &
> > done
> > this simply reads a sparse file for each CPU. Apparently the
> > zone->percpu_drift_mark is too big, and guess zone_page_state_snapshot() makes
> > a lot of cache bounce for ->vm_stat_diff[]. below is the zoneinfo for reference.
> > Is there any way to reduce the overhead?
> 

The overhead is higher than I would have expected. I would guess the
cache bounces are a real problem.

> I guess Mel could reduce the percpu_drift_mark? Or tune that with a
> reduction in the stat_threshold? The less the count can deviate the less
> the percpu_drift_mark has to be and the less we need calls to
> zone_page_state_snapshot.
> 

This is true. It's helpful to remember why this patch exists. Under heavy
memory pressure, large machines run the risk of live-locking because the
NR_FREE_PAGES gets out of sync. The test case mentioned above is under
memory pressure so it is potentially at risk. Ordinarily, we would be less
concerned with performance under heavy memory pressure and more concerned with
correctness of behaviour. The percpu_drift_mark is set at a point where the
risk is "real".  Lowering it will help performance but increase risk. Reducing
stat_threshold shifts the cost elsewhere by increasing the frequency the
vmstat counters are updated which I considered to be worse overall.

Which of these is better or is there an alternative suggestion on how
this livelock can be avoided?

As a heads up, I'm preparing for exams at the moment and while I'm online, I'm
not in the position to prototype patches and test them at the moment but can
review alternative proposals if people have them. I'm also out early next week.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-09-28 13:30   ` Mel Gorman
@ 2010-09-28 13:40     ` Christoph Lameter
  2010-09-28 13:51       ` Mel Gorman
  2010-09-29  4:02     ` David Rientjes
  1 sibling, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2010-09-28 13:40 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Shaohua Li, linux-mm

On Tue, 28 Sep 2010, Mel Gorman wrote:

> Which of these is better or is there an alternative suggestion on how
> this livelock can be avoided?

We need to run some experiments to see what is worse. Lets start by
cutting both the stats threshold and the drift thing in half?

> As a heads up, I'm preparing for exams at the moment and while I'm online, I'm
> not in the position to prototype patches and test them at the moment but can
> review alternative proposals if people have them. I'm also out early next week.

Exams? You are finally graduating?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-09-28 13:40     ` Christoph Lameter
@ 2010-09-28 13:51       ` Mel Gorman
  2010-09-28 14:08         ` Christoph Lameter
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2010-09-28 13:51 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Shaohua Li, linux-mm

On Tue, Sep 28, 2010 at 08:40:15AM -0500, Christoph Lameter wrote:
> On Tue, 28 Sep 2010, Mel Gorman wrote:
> 
> > Which of these is better or is there an alternative suggestion on how
> > this livelock can be avoided?
> 
> We need to run some experiments to see what is worse. Lets start by
> cutting both the stats threshold and the drift thing in half?
> 

Ok, I have no problem with that although again, I'm really not in the position
to roll patches for it right now. I don't want to get side-tracked.

> > As a heads up, I'm preparing for exams at the moment and while I'm online, I'm
> > not in the position to prototype patches and test them at the moment but can
> > review alternative proposals if people have them. I'm also out early next week.
> 
> Exams? You are finally graduating?
> 

Depends on how they go :)

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-09-28 13:51       ` Mel Gorman
@ 2010-09-28 14:08         ` Christoph Lameter
  2010-09-29  3:02           ` Shaohua Li
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2010-09-28 14:08 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Shaohua Li, linux-mm


On Tue, 28 Sep 2010, Mel Gorman wrote:

> On Tue, Sep 28, 2010 at 08:40:15AM -0500, Christoph Lameter wrote:
> > On Tue, 28 Sep 2010, Mel Gorman wrote:
> >
> > > Which of these is better or is there an alternative suggestion on how
> > > this livelock can be avoided?
> >
> > We need to run some experiments to see what is worse. Lets start by
> > cutting both the stats threshold and the drift thing in half?
> >
>
> Ok, I have no problem with that although again, I'm really not in the position
> to roll patches for it right now. I don't want to get side-tracked.

Ok the stat threshold determines the per_cpu_drift_mark.

So changing the threshold should do the trick. Try this:

---
 mm/vmstat.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2010-09-28 09:04:48.000000000 -0500
+++ linux-2.6/mm/vmstat.c	2010-09-28 09:05:16.000000000 -0500
@@ -118,7 +118,7 @@ static int calculate_threshold(struct zo

 	mem = zone->present_pages >> (27 - PAGE_SHIFT);

-	threshold = 2 * fls(num_online_cpus()) * (1 + fls(mem));
+	threshold = fls(num_online_cpus()) * (1 + fls(mem));

 	/*
 	 * Maximum threshold is 125

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-09-28 14:08         ` Christoph Lameter
@ 2010-09-29  3:02           ` Shaohua Li
  0 siblings, 0 replies; 65+ messages in thread
From: Shaohua Li @ 2010-09-29  3:02 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Mel Gorman, linux-mm

On Tue, 2010-09-28 at 22:08 +0800, Christoph Lameter wrote:
> On Tue, 28 Sep 2010, Mel Gorman wrote:
> 
> > On Tue, Sep 28, 2010 at 08:40:15AM -0500, Christoph Lameter wrote:
> > > On Tue, 28 Sep 2010, Mel Gorman wrote:
> > >
> > > > Which of these is better or is there an alternative suggestion on how
> > > > this livelock can be avoided?
> > >
> > > We need to run some experiments to see what is worse. Lets start by
> > > cutting both the stats threshold and the drift thing in half?
> > >
> >
> > Ok, I have no problem with that although again, I'm really not in the position
> > to roll patches for it right now. I don't want to get side-tracked.
> 
> Ok the stat threshold determines the per_cpu_drift_mark.
> 
> So changing the threshold should do the trick. Try this:
doesn't work here, perf still shows the same overhead.

in the system:
Node 3, zone   Normal
pages free     2055926
        min      1441
        low      1801
        high     2161
        scanned  0
        spanned  2097152
        present  2068480
  vm stats threshold: 98
(low-min)/NR_CPU = (1801-1441)/64 = 5
so when the threshold is 5, there is no per_cpu_drift_mark.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-09-28 13:30   ` Mel Gorman
  2010-09-28 13:40     ` Christoph Lameter
@ 2010-09-29  4:02     ` David Rientjes
  2010-09-29  4:47       ` Shaohua Li
  2010-09-29 10:03       ` Mel Gorman
  1 sibling, 2 replies; 65+ messages in thread
From: David Rientjes @ 2010-09-29  4:02 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Lameter, Shaohua Li, linux-mm

On Tue, 28 Sep 2010, Mel Gorman wrote:

> This is true. It's helpful to remember why this patch exists. Under heavy
> memory pressure, large machines run the risk of live-locking because the
> NR_FREE_PAGES gets out of sync. The test case mentioned above is under
> memory pressure so it is potentially at risk. Ordinarily, we would be less
> concerned with performance under heavy memory pressure and more concerned with
> correctness of behaviour. The percpu_drift_mark is set at a point where the
> risk is "real".  Lowering it will help performance but increase risk. Reducing
> stat_threshold shifts the cost elsewhere by increasing the frequency the
> vmstat counters are updated which I considered to be worse overall.
> 
> Which of these is better or is there an alternative suggestion on how
> this livelock can be avoided?
> 

I don't think the risk is quite real based on the calculation of 
percpu_drift_mark using the high watermark instead of the min watermark.  
For Shaohua's 64 cpu system:

Node 3, zone   Normal
pages free     2055926
        min      1441
        low      1801
        high     2161
        scanned  0
        spanned  2097152
        present  2068480
  vm stats threshold: 98

It's possible that we'll be 98 pages/cpu * 64 cpus = 6272 pages off in the 
NR_FREE_PAGES accounting at any given time.  So to avoid depleting memory 
reserves at the min watermark, which is livelock, and unnecessarily 
spending time doing reclaim, percpu_drift_mark should be
1801 + 6272 = 8073 pages.  Instead, we're currently using the high 
watermark, so percpu_drift_mark is 8433 pages.

It's plausible that we never reclaim sufficient memory that we ever get 
above the high watermark since we only trigger reclaim when we can't 
allocate above low, so we may be stuck calling zone_page_state_snapshot() 
constantly.

I'd be interested to see if this patch helps.
---
diff --git a/mm/vmstat.c b/mm/vmstat.c
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -154,7 +154,7 @@ static void refresh_zone_stat_thresholds(void)
 		tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
 		max_drift = num_online_cpus() * threshold;
 		if (max_drift > tolerate_drift)
-			zone->percpu_drift_mark = high_wmark_pages(zone) +
+			zone->percpu_drift_mark = low_wmark_pages(zone) +
 					max_drift;
 	}
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-09-29  4:02     ` David Rientjes
@ 2010-09-29  4:47       ` Shaohua Li
  2010-09-29  5:06         ` David Rientjes
  2010-09-29 10:03       ` Mel Gorman
  1 sibling, 1 reply; 65+ messages in thread
From: Shaohua Li @ 2010-09-29  4:47 UTC (permalink / raw)
  To: David Rientjes; +Cc: Mel Gorman, Christoph Lameter, linux-mm

On Wed, 2010-09-29 at 12:02 +0800, David Rientjes wrote:
> On Tue, 28 Sep 2010, Mel Gorman wrote:
> 
> > This is true. It's helpful to remember why this patch exists. Under heavy
> > memory pressure, large machines run the risk of live-locking because the
> > NR_FREE_PAGES gets out of sync. The test case mentioned above is under
> > memory pressure so it is potentially at risk. Ordinarily, we would be less
> > concerned with performance under heavy memory pressure and more concerned with
> > correctness of behaviour. The percpu_drift_mark is set at a point where the
> > risk is "real".  Lowering it will help performance but increase risk. Reducing
> > stat_threshold shifts the cost elsewhere by increasing the frequency the
> > vmstat counters are updated which I considered to be worse overall.
> > 
> > Which of these is better or is there an alternative suggestion on how
> > this livelock can be avoided?
> > 
> 
> I don't think the risk is quite real based on the calculation of 
> percpu_drift_mark using the high watermark instead of the min watermark.  
> For Shaohua's 64 cpu system:
> 
> Node 3, zone   Normal
> pages free     2055926
>         min      1441
>         low      1801
>         high     2161
>         scanned  0
>         spanned  2097152
>         present  2068480
>   vm stats threshold: 98
> 
> It's possible that we'll be 98 pages/cpu * 64 cpus = 6272 pages off in the 
> NR_FREE_PAGES accounting at any given time.  So to avoid depleting memory 
> reserves at the min watermark, which is livelock, and unnecessarily 
> spending time doing reclaim, percpu_drift_mark should be
> 1801 + 6272 = 8073 pages.  Instead, we're currently using the high 
> watermark, so percpu_drift_mark is 8433 pages.
> 
> It's plausible that we never reclaim sufficient memory that we ever get 
> above the high watermark since we only trigger reclaim when we can't 
> allocate above low, so we may be stuck calling zone_page_state_snapshot() 
> constantly.
> 
> I'd be interested to see if this patch helps.
> ---
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -154,7 +154,7 @@ static void refresh_zone_stat_thresholds(void)
>  		tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
>  		max_drift = num_online_cpus() * threshold;
>  		if (max_drift > tolerate_drift)
> -			zone->percpu_drift_mark = high_wmark_pages(zone) +
> +			zone->percpu_drift_mark = low_wmark_pages(zone) +
>  					max_drift;
>  	}
>  }
I'm afraid not. I tried Christoph's patch, which doesn't help.
in that patch, the threshold = 6272/2 = 3136. and the percpu_drift_mark
is 3136 + 2161 < 8073

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-09-29  4:47       ` Shaohua Li
@ 2010-09-29  5:06         ` David Rientjes
  0 siblings, 0 replies; 65+ messages in thread
From: David Rientjes @ 2010-09-29  5:06 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Mel Gorman, Christoph Lameter, linux-mm

On Wed, 29 Sep 2010, Shaohua Li wrote:

> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -154,7 +154,7 @@ static void refresh_zone_stat_thresholds(void)
> >  		tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
> >  		max_drift = num_online_cpus() * threshold;
> >  		if (max_drift > tolerate_drift)
> > -			zone->percpu_drift_mark = high_wmark_pages(zone) +
> > +			zone->percpu_drift_mark = low_wmark_pages(zone) +
> >  					max_drift;
> >  	}
> >  }
> I'm afraid not. I tried Christoph's patch, which doesn't help.
> in that patch, the threshold = 6272/2 = 3136. and the percpu_drift_mark
> is 3136 + 2161 < 8073
> 

I think my patch is conceptually correct based on the real risk that Mel 
was describing.  With Christoph's patch the threshold would have been 49 
(max_drift is 3136), which would also increase the the cacheline bounce 
for zone_page_state_add(), but the penalty of zone_page_state_snapshot() 
is probably much higher.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-09-29  4:02     ` David Rientjes
  2010-09-29  4:47       ` Shaohua Li
@ 2010-09-29 10:03       ` Mel Gorman
  2010-09-29 14:12         ` Christoph Lameter
  2010-09-29 19:44         ` David Rientjes
  1 sibling, 2 replies; 65+ messages in thread
From: Mel Gorman @ 2010-09-29 10:03 UTC (permalink / raw)
  To: David Rientjes; +Cc: Christoph Lameter, Shaohua Li, linux-mm

On Tue, Sep 28, 2010 at 09:02:59PM -0700, David Rientjes wrote:
> On Tue, 28 Sep 2010, Mel Gorman wrote:
> 
> > This is true. It's helpful to remember why this patch exists. Under heavy
> > memory pressure, large machines run the risk of live-locking because the
> > NR_FREE_PAGES gets out of sync. The test case mentioned above is under
> > memory pressure so it is potentially at risk. Ordinarily, we would be less
> > concerned with performance under heavy memory pressure and more concerned with
> > correctness of behaviour. The percpu_drift_mark is set at a point where the
> > risk is "real".  Lowering it will help performance but increase risk. Reducing
> > stat_threshold shifts the cost elsewhere by increasing the frequency the
> > vmstat counters are updated which I considered to be worse overall.
> > 
> > Which of these is better or is there an alternative suggestion on how
> > this livelock can be avoided?
> > 
> 
> I don't think the risk is quite real based on the calculation of 
> percpu_drift_mark using the high watermark instead of the min watermark.  
> For Shaohua's 64 cpu system:
> 
> Node 3, zone   Normal
> pages free     2055926
>         min      1441
>         low      1801
>         high     2161
>         scanned  0
>         spanned  2097152
>         present  2068480
>   vm stats threshold: 98
> 
> It's possible that we'll be 98 pages/cpu * 64 cpus = 6272 pages off in the 
> NR_FREE_PAGES accounting at any given time. 

Right.

> So to avoid depleting memory 
> reserves at the min watermark, which is livelock, and unnecessarily 
> spending time doing reclaim, percpu_drift_mark should be
> 1801 + 6272 = 8073 pages.  Instead, we're currently using the high 
> watermark, so percpu_drift_mark is 8433 pages.
> 

The point of calculating from the high watermark was to prevent kswapd
going to sleep prematurely but if it can be shown the problem goes away
using just the low watermark, I'd go with it. I'm skeptical though for
reasons I outline below.

> It's plausible that we never reclaim sufficient memory that we ever get 
> above the high watermark since we only trigger reclaim when we can't 
> allocate above low, so we may be stuck calling zone_page_state_snapshot() 
> constantly.
> 

Except that zone_page_state_snapshot() is only called while kswapd is
awake which is the proxy indicator of pressure. Just being below
percpu_drift_mark is not enough to call zone_page_state_snapshot.

> I'd be interested to see if this patch helps.
> ---
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -154,7 +154,7 @@ static void refresh_zone_stat_thresholds(void)
>  		tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
>  		max_drift = num_online_cpus() * threshold;
>  		if (max_drift > tolerate_drift)
> -			zone->percpu_drift_mark = high_wmark_pages(zone) +
> +			zone->percpu_drift_mark = low_wmark_pages(zone) +
>  					max_drift;
>  	}

Well, this in itself would not fix one problem you highlight - kswapd does
not reclaim enough to keep a zone above the percpu_drift_mark meaning that
the instant it wakes, zone_page_state_snapshot() is in use and continually
in use while kswapd is awake. These are the marks of interest at the moment;

min		1441
low		1801
high		2161
driftdanger	8433

kswapd can be mostly awake, keeping ahead of the allocators by
maintaining a free level somewhere between low and high while
zone_page_state_snapshot() is continually in use.

Maybe when percpu_drift_mark is set due to large machines, the
watermarks need to change so that high = percpu_drift_mark + low? That
would make the marks

min		1441
low		1801
driftdanger	8073
high		9874

That would improve the situation slightly by widening the window between
kswapd going to sleep and waking up due to memory pressure while also having
a window where kswapd is awake but zone_page_state_snapshot() is not in
use. It doesn't help if the pressure is enough to keep kswapd awake and at
a level between low and driftdanger.

Alternatively we could revisit Christoph's suggestion of modifying
stat_threshold when under pressure instead of zone_page_state_snapshot. Maybe
by temporarily stat_threshold when kswapd is awake to a per-zone value
such that

zone->low + threshold*nr_online_cpus < high

?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-09-29 10:03       ` Mel Gorman
@ 2010-09-29 14:12         ` Christoph Lameter
  2010-09-29 14:17           ` Mel Gorman
  2010-09-29 19:44         ` David Rientjes
  1 sibling, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2010-09-29 14:12 UTC (permalink / raw)
  To: Mel Gorman; +Cc: David Rientjes, Shaohua Li, linux-mm

On Wed, 29 Sep 2010, Mel Gorman wrote:

> Alternatively we could revisit Christoph's suggestion of modifying
> stat_threshold when under pressure instead of zone_page_state_snapshot. Maybe
> by temporarily stat_threshold when kswapd is awake to a per-zone value
> such that
>
> zone->low + threshold*nr_online_cpus < high

Updating the threshold also is expensive. I thought more along the lines
of reducing the threshold for good if the VM runs into reclaim trouble
because of too high fuzziness in the counters.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-09-29 14:12         ` Christoph Lameter
@ 2010-09-29 14:17           ` Mel Gorman
  2010-09-29 14:34             ` Christoph Lameter
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2010-09-29 14:17 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David Rientjes, Shaohua Li, linux-mm

On Wed, Sep 29, 2010 at 09:12:25AM -0500, Christoph Lameter wrote:
> On Wed, 29 Sep 2010, Mel Gorman wrote:
> 
> > Alternatively we could revisit Christoph's suggestion of modifying
> > stat_threshold when under pressure instead of zone_page_state_snapshot. Maybe
> > by temporarily stat_threshold when kswapd is awake to a per-zone value
> > such that
> >
> > zone->low + threshold*nr_online_cpus < high
> 
> Updating the threshold also is expensive.

Even if it's moved to a read-mostly part of the zone such as after
lowmem_reserve?

> I thought more along the lines
> of reducing the threshold for good if the VM runs into reclaim trouble
> because of too high fuzziness in the counters.
> 

That would be unfortunate as it would only take trouble to happen once
for performance to be impaired for the remaining uptime of the machine.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-09-29 14:17           ` Mel Gorman
@ 2010-09-29 14:34             ` Christoph Lameter
  2010-09-29 14:41               ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2010-09-29 14:34 UTC (permalink / raw)
  To: Mel Gorman; +Cc: David Rientjes, Shaohua Li, linux-mm

On Wed, 29 Sep 2010, Mel Gorman wrote:

> > Updating the threshold also is expensive.
>
> Even if it's moved to a read-mostly part of the zone such as after
> lowmem_reserve?

The threshold is stored in the hot part of the per cpu page structure.

> > I thought more along the lines
> > of reducing the threshold for good if the VM runs into reclaim trouble
> > because of too high fuzziness in the counters.
> >
>
> That would be unfortunate as it would only take trouble to happen once
> for performance to be impaired for the remaining uptime of the machine.

Reclaim also impairs performance and inaccurate counters may cause
unnecessary reclaim. Ultimately this is a tradeoff. The current thresholds
were calculated so that there will be zero impact even for very large
configurations where all processors continual page fault. I think we have
some leeway to go lower there. The tuning situation was a bit extreme.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-09-29 14:34             ` Christoph Lameter
@ 2010-09-29 14:41               ` Mel Gorman
  2010-09-29 14:45                 ` Mel Gorman
  2010-09-29 14:52                 ` Christoph Lameter
  0 siblings, 2 replies; 65+ messages in thread
From: Mel Gorman @ 2010-09-29 14:41 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David Rientjes, Shaohua Li, linux-mm

On Wed, Sep 29, 2010 at 09:34:09AM -0500, Christoph Lameter wrote:
> On Wed, 29 Sep 2010, Mel Gorman wrote:
> 
> > > Updating the threshold also is expensive.
> >
> > Even if it's moved to a read-mostly part of the zone such as after
> > lowmem_reserve?
> 
> The threshold is stored in the hot part of the per cpu page structure.
> 

And the consequences of moving it? In terms of moving, it would probably
work out better to move percpu_drift_mark after the lowmem_reserve and
put the threshold after it so they're at least similarly hot across
CPUs.

> > > I thought more along the lines
> > > of reducing the threshold for good if the VM runs into reclaim trouble
> > > because of too high fuzziness in the counters.
> > >
> >
> > That would be unfortunate as it would only take trouble to happen once
> > for performance to be impaired for the remaining uptime of the machine.
> 
> Reclaim also impairs performance and inaccurate counters may cause
> unnecessary reclaim.

Ah, it's limited to be fair. You might end up reclaiming "maximum drift"
number of pages you didn't need to but that doesn't seem as bad.

> Ultimately this is a tradeoff. The current thresholds
> were calculated so that there will be zero impact even for very large
> configurations where all processors continual page fault. I think we have
> some leeway to go lower there. The tuning situation was a bit extreme.
> 

Ok.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-09-29 14:41               ` Mel Gorman
@ 2010-09-29 14:45                 ` Mel Gorman
  2010-09-29 14:54                   ` Christoph Lameter
  2010-09-29 14:52                 ` Christoph Lameter
  1 sibling, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2010-09-29 14:45 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David Rientjes, Shaohua Li, linux-mm

On Wed, Sep 29, 2010 at 03:41:59PM +0100, Mel Gorman wrote:
> On Wed, Sep 29, 2010 at 09:34:09AM -0500, Christoph Lameter wrote:
> > On Wed, 29 Sep 2010, Mel Gorman wrote:
> > 
> > > > Updating the threshold also is expensive.
> > >
> > > Even if it's moved to a read-mostly part of the zone such as after
> > > lowmem_reserve?
> > 
> > The threshold is stored in the hot part of the per cpu page structure.
> > 
> 
> And the consequences of moving it? In terms of moving, it would probably
> work out better to move percpu_drift_mark after the lowmem_reserve and
> put the threshold after it so they're at least similarly hot across
> CPUs.
> 

I should be clearer here. Initially, I'm thinking the consequences of moving
it are not terrible bad so I'm wondering if you see some problem I have not
thought of. If the threshold value is sharing the cache line with watermark
or lowmem_reserve, then it should still have the same hotness in the path
we really care about (zone_watermark_ok for example) without necessarily
needing to be part of the per-cpu structure. The real badness would be if an
additional cache line was required due to the move but I don't think this is
the case (but I didn't double check with pahole or the back of an envelope
either). The line will be dirtied and cause a bounce when kswapd wakes or
goes to sleep but this should not be a severe problem.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-09-29 14:41               ` Mel Gorman
  2010-09-29 14:45                 ` Mel Gorman
@ 2010-09-29 14:52                 ` Christoph Lameter
  1 sibling, 0 replies; 65+ messages in thread
From: Christoph Lameter @ 2010-09-29 14:52 UTC (permalink / raw)
  To: Mel Gorman; +Cc: David Rientjes, Shaohua Li, linux-mm

On Wed, 29 Sep 2010, Mel Gorman wrote:

> > The threshold is stored in the hot part of the per cpu page structure.
> >
>
> And the consequences of moving it? In terms of moving, it would probably
> work out better to move percpu_drift_mark after the lowmem_reserve and
> put the threshold after it so they're at least similarly hot across
> CPUs.

If you move it then the cache footprint of the vm stat functions (which
need to access the threshold for each access!) will increase and the
performance sink dramatically. I tried to avoid placing the threshold
there when I developed that approach but it always caused a dramatic
regression under heavy load.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-09-29 14:45                 ` Mel Gorman
@ 2010-09-29 14:54                   ` Christoph Lameter
  0 siblings, 0 replies; 65+ messages in thread
From: Christoph Lameter @ 2010-09-29 14:54 UTC (permalink / raw)
  To: Mel Gorman; +Cc: David Rientjes, Shaohua Li, linux-mm

On Wed, 29 Sep 2010, Mel Gorman wrote:

> I should be clearer here. Initially, I'm thinking the consequences of moving
> it are not terrible bad so I'm wondering if you see some problem I have not
> thought of. If the threshold value is sharing the cache line with watermark
> or lowmem_reserve, then it should still have the same hotness in the path
> we really care about (zone_watermark_ok for example) without necessarily
> needing to be part of the per-cpu structure. The real badness would be if an
> additional cache line was required due to the move but I don't think this is
> the case (but I didn't double check with pahole or the back of an envelope
> either). The line will be dirtied and cause a bounce when kswapd wakes or
> goes to sleep but this should not be a severe problem.

The critical paths for vm statistics are __inc_zone_state() and
__dec_zone_state(). Those are sprinkled all over.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-09-29 10:03       ` Mel Gorman
  2010-09-29 14:12         ` Christoph Lameter
@ 2010-09-29 19:44         ` David Rientjes
  1 sibling, 0 replies; 65+ messages in thread
From: David Rientjes @ 2010-09-29 19:44 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Lameter, Shaohua Li, linux-mm

On Wed, 29 Sep 2010, Mel Gorman wrote:

> > It's plausible that we never reclaim sufficient memory that we ever get 
> > above the high watermark since we only trigger reclaim when we can't 
> > allocate above low, so we may be stuck calling zone_page_state_snapshot() 
> > constantly.
> > 
> 
> Except that zone_page_state_snapshot() is only called while kswapd is
> awake which is the proxy indicator of pressure. Just being below
> percpu_drift_mark is not enough to call zone_page_state_snapshot.
> 

Right, so zone_page_state_snapshot() is always called to check the min 
watermark for the subsequent allocation immediately after kswapd is kicked 
in the slow path, meaning it is called for every allocation when the zone 
is between low and min.  That's 360 pages for Shaohua's system and even 
more if GFP_ATOMIC.  kswapd will reclaim to the high watermark, 360 pages 
above low, using zone_page_state_snapshot() the whole time as well.  So 
under heavy memory pressure, it seems like the majority of 
zone_watermark_ok() calls are using zone_page_state_snapshot() anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-09-28  5:08 zone state overhead Shaohua Li
  2010-09-28 12:39 ` Christoph Lameter
@ 2010-10-08 15:29 ` Mel Gorman
  2010-10-09  0:58   ` Shaohua Li
  1 sibling, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2010-10-08 15:29 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-mm, cl

On Tue, Sep 28, 2010 at 01:08:01PM +0800, Shaohua Li wrote:
> In a 4 socket 64 CPU system, zone_nr_free_pages() takes about 5% ~ 10% cpu time
> according to perf when memory pressure is high. The workload does something
> like:
> for i in `seq 1 $nr_cpu`
> do
>         create_sparse_file $SPARSE_FILE-$i $((10 * mem / nr_cpu))
>         $USEMEM -f $SPARSE_FILE-$i -j 4096 --readonly $((10 * mem / nr_cpu)) &
> done
> this simply reads a sparse file for each CPU. Apparently the
> zone->percpu_drift_mark is too big, and guess zone_page_state_snapshot() makes
> a lot of cache bounce for ->vm_stat_diff[]. below is the zoneinfo for reference.

Would it be possible for you to post the oprofile report? I'm in the
early stages of trying to reproduce this locally based on your test
description. The first machine I tried showed that zone_nr_page_state
was consuming 0.26% of profile time with the vast bulk occupied by
do_mpage_readahead. See as follows

1599339  53.3463  vmlinux-2.6.36-rc7-pcpudrift do_mpage_readpage
131713    4.3933  vmlinux-2.6.36-rc7-pcpudrift __isolate_lru_page
103958    3.4675  vmlinux-2.6.36-rc7-pcpudrift free_pcppages_bulk
85024     2.8360  vmlinux-2.6.36-rc7-pcpudrift __rmqueue
78697     2.6250  vmlinux-2.6.36-rc7-pcpudrift native_flush_tlb_others
75678     2.5243  vmlinux-2.6.36-rc7-pcpudrift unlock_page
68741     2.2929  vmlinux-2.6.36-rc7-pcpudrift get_page_from_freelist
56043     1.8693  vmlinux-2.6.36-rc7-pcpudrift __alloc_pages_nodemask
55863     1.8633  vmlinux-2.6.36-rc7-pcpudrift ____pagevec_lru_add
46044     1.5358  vmlinux-2.6.36-rc7-pcpudrift radix_tree_delete
44543     1.4857  vmlinux-2.6.36-rc7-pcpudrift shrink_page_list
33636     1.1219  vmlinux-2.6.36-rc7-pcpudrift zone_watermark_ok
.....
7855      0.2620  vmlinux-2.6.36-rc7-pcpudrift zone_nr_free_pages

The machine I am testing on is non-NUMA 4-core single socket and totally
different characteristics but I want to be sure I'm going more or less the
right direction with the reproduction case before trying to find a larger
machine.

Thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-10-08 15:29 ` Mel Gorman
@ 2010-10-09  0:58   ` Shaohua Li
  2010-10-11  8:56     ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: Shaohua Li @ 2010-10-09  0:58 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, cl

On Fri, Oct 08, 2010 at 11:29:53PM +0800, Mel Gorman wrote:
> On Tue, Sep 28, 2010 at 01:08:01PM +0800, Shaohua Li wrote:
> > In a 4 socket 64 CPU system, zone_nr_free_pages() takes about 5% ~ 10% cpu time
> > according to perf when memory pressure is high. The workload does something
> > like:
> > for i in `seq 1 $nr_cpu`
> > do
> >         create_sparse_file $SPARSE_FILE-$i $((10 * mem / nr_cpu))
> >         $USEMEM -f $SPARSE_FILE-$i -j 4096 --readonly $((10 * mem / nr_cpu)) &
> > done
> > this simply reads a sparse file for each CPU. Apparently the
> > zone->percpu_drift_mark is too big, and guess zone_page_state_snapshot() makes
> > a lot of cache bounce for ->vm_stat_diff[]. below is the zoneinfo for reference.
> 
> Would it be possible for you to post the oprofile report? I'm in the
> early stages of trying to reproduce this locally based on your test
> description. The first machine I tried showed that zone_nr_page_state
> was consuming 0.26% of profile time with the vast bulk occupied by
> do_mpage_readahead. See as follows
> 
> 1599339  53.3463  vmlinux-2.6.36-rc7-pcpudrift do_mpage_readpage
> 131713    4.3933  vmlinux-2.6.36-rc7-pcpudrift __isolate_lru_page
> 103958    3.4675  vmlinux-2.6.36-rc7-pcpudrift free_pcppages_bulk
> 85024     2.8360  vmlinux-2.6.36-rc7-pcpudrift __rmqueue
> 78697     2.6250  vmlinux-2.6.36-rc7-pcpudrift native_flush_tlb_others
> 75678     2.5243  vmlinux-2.6.36-rc7-pcpudrift unlock_page
> 68741     2.2929  vmlinux-2.6.36-rc7-pcpudrift get_page_from_freelist
> 56043     1.8693  vmlinux-2.6.36-rc7-pcpudrift __alloc_pages_nodemask
> 55863     1.8633  vmlinux-2.6.36-rc7-pcpudrift ____pagevec_lru_add
> 46044     1.5358  vmlinux-2.6.36-rc7-pcpudrift radix_tree_delete
> 44543     1.4857  vmlinux-2.6.36-rc7-pcpudrift shrink_page_list
> 33636     1.1219  vmlinux-2.6.36-rc7-pcpudrift zone_watermark_ok
> .....
> 7855      0.2620  vmlinux-2.6.36-rc7-pcpudrift zone_nr_free_pages
> 
> The machine I am testing on is non-NUMA 4-core single socket and totally
> different characteristics but I want to be sure I'm going more or less the
> right direction with the reproduction case before trying to find a larger
> machine.
Here it is. this is a 4 socket nahalem machine.
           268160.00 57.2% _raw_spin_lock                      /lib/modules/2.6.36-rc5-shli+/build/vmlinux
            40302.00  8.6% zone_nr_free_pages                  /lib/modules/2.6.36-rc5-shli+/build/vmlinux
            36827.00  7.9% do_mpage_readpage                   /lib/modules/2.6.36-rc5-shli+/build/vmlinux
            28011.00  6.0% _raw_spin_lock_irq                  /lib/modules/2.6.36-rc5-shli+/build/vmlinux
            22973.00  4.9% flush_tlb_others_ipi                /lib/modules/2.6.36-rc5-shli+/build/vmlinux
            10713.00  2.3% smp_invalidate_interrupt            /lib/modules/2.6.36-rc5-shli+/build/vmlinux
             7342.00  1.6% find_next_bit                       /lib/modules/2.6.36-rc5-shli+/build/vmlinux
             4571.00  1.0% try_to_unmap_one                    /lib/modules/2.6.36-rc5-shli+/build/vmlinux
             4094.00  0.9% default_send_IPI_mask_sequence_phys /lib/modules/2.6.36-rc5-shli+/build/vmlinux
             3497.00  0.7% get_page_from_freelist              /lib/modules/2.6.36-rc5-shli+/build/vmlinux
             3032.00  0.6% _raw_spin_lock_irqsave              /lib/modules/2.6.36-rc5-shli+/build/vmlinux
             3029.00  0.6% shrink_page_list                    /lib/modules/2.6.36-rc5-shli+/build/vmlinux
             2318.00  0.5% __inc_zone_state                    /lib/modules/2.6.36-rc5-shli+/build/vmlinux
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-10-09  0:58   ` Shaohua Li
@ 2010-10-11  8:56     ` Mel Gorman
  2010-10-12  1:05       ` Shaohua Li
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2010-10-11  8:56 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-mm, cl

On Sat, Oct 09, 2010 at 08:58:07AM +0800, Shaohua Li wrote:
> On Fri, Oct 08, 2010 at 11:29:53PM +0800, Mel Gorman wrote:
> > On Tue, Sep 28, 2010 at 01:08:01PM +0800, Shaohua Li wrote:
> > > In a 4 socket 64 CPU system, zone_nr_free_pages() takes about 5% ~ 10% cpu time
> > > according to perf when memory pressure is high. The workload does something
> > > like:
> > > for i in `seq 1 $nr_cpu`
> > > do
> > >         create_sparse_file $SPARSE_FILE-$i $((10 * mem / nr_cpu))
> > >         $USEMEM -f $SPARSE_FILE-$i -j 4096 --readonly $((10 * mem / nr_cpu)) &
> > > done
> > > this simply reads a sparse file for each CPU. Apparently the
> > > zone->percpu_drift_mark is too big, and guess zone_page_state_snapshot() makes
> > > a lot of cache bounce for ->vm_stat_diff[]. below is the zoneinfo for reference.
> > 
> > Would it be possible for you to post the oprofile report? I'm in the
> > early stages of trying to reproduce this locally based on your test
> > description. The first machine I tried showed that zone_nr_page_state
> > was consuming 0.26% of profile time with the vast bulk occupied by
> > do_mpage_readahead. See as follows
> > 
> > 1599339  53.3463  vmlinux-2.6.36-rc7-pcpudrift do_mpage_readpage
> > 131713    4.3933  vmlinux-2.6.36-rc7-pcpudrift __isolate_lru_page
> > 103958    3.4675  vmlinux-2.6.36-rc7-pcpudrift free_pcppages_bulk
> > 85024     2.8360  vmlinux-2.6.36-rc7-pcpudrift __rmqueue
> > 78697     2.6250  vmlinux-2.6.36-rc7-pcpudrift native_flush_tlb_others
> > 75678     2.5243  vmlinux-2.6.36-rc7-pcpudrift unlock_page
> > 68741     2.2929  vmlinux-2.6.36-rc7-pcpudrift get_page_from_freelist
> > 56043     1.8693  vmlinux-2.6.36-rc7-pcpudrift __alloc_pages_nodemask
> > 55863     1.8633  vmlinux-2.6.36-rc7-pcpudrift ____pagevec_lru_add
> > 46044     1.5358  vmlinux-2.6.36-rc7-pcpudrift radix_tree_delete
> > 44543     1.4857  vmlinux-2.6.36-rc7-pcpudrift shrink_page_list
> > 33636     1.1219  vmlinux-2.6.36-rc7-pcpudrift zone_watermark_ok
> > .....
> > 7855      0.2620  vmlinux-2.6.36-rc7-pcpudrift zone_nr_free_pages
> > 
> > The machine I am testing on is non-NUMA 4-core single socket and totally
> > different characteristics but I want to be sure I'm going more or less the
> > right direction with the reproduction case before trying to find a larger
> > machine.
> Here it is. this is a 4 socket nahalem machine.
>            268160.00 57.2% _raw_spin_lock                      /lib/modules/2.6.36-rc5-shli+/build/vmlinux
>             40302.00  8.6% zone_nr_free_pages                  /lib/modules/2.6.36-rc5-shli+/build/vmlinux
>             36827.00  7.9% do_mpage_readpage                   /lib/modules/2.6.36-rc5-shli+/build/vmlinux
>             28011.00  6.0% _raw_spin_lock_irq                  /lib/modules/2.6.36-rc5-shli+/build/vmlinux
>             22973.00  4.9% flush_tlb_others_ipi                /lib/modules/2.6.36-rc5-shli+/build/vmlinux
>             10713.00  2.3% smp_invalidate_interrupt            /lib/modules/2.6.36-rc5-shli+/build/vmlinux

Ok, we are seeing *very* different things. Can you tell me more about
what usemem actually does? I thought it might be doing something like
mapping the file and just reading it but that doesn't appear to be the
case. I also tried using madvise dropping pages to strictly limit how
much memory was used but the profiles are still different.

I've posted the very basic test script I was using based on your
description. Can you tell me what usemem does differently or better again,
post the source of usemem? Can you also post your .config please. I'm curious
to see why you are seeing so much more locking overhead. If you have lock
debugging and lock stat enabled, would it be possible to test without them
enabled to see what the profile looks like?

Thanks

==== CUT HERE ====
#!/bin/bash
# Benchmark is just a basic memory pressure. Based on a test suggested by
# Shaohua Li in a bug report complaining about the overhead of measuring
# NR_FREE_PAGES under memory pressure
#
# Copyright Mel Gorman 2010
NUM_CPU=$(grep -c '^processor' /proc/cpuinfo)
MEMTOTAL_BYTES=`free -b | grep Mem: | awk '{print $2}'`
DURATION=300
SELF=$0

if [ "$1" != "" ]; then
	DURATION=$1
fi

function create_sparse_file() {
	TITLE=$1
	SIZE=$2

	echo Creating sparse file $TITLE
	dd if=/dev/zero of=$TITLE bs=4096 count=0 seek=$((SIZE/4096+1))
}

echo -n > memorypressure-sparsefile-$$.pids
if [ "$SHELLPACK_TEMP" != "" ]; then
	cd $SHELLPACK_TEMP || die Failed to cd to temporary directory
fi

# Extract and build usemem program
LINECOUNT=`wc -l $SELF | awk '{print $1}'`
CSTART=`grep -n "BEGIN C FILE" $SELF | tail -1 | awk -F : '{print $1}'`
tail -$(($LINECOUNT-$CSTART)) $SELF > usemem.c
gcc -m64 -O2 usemem.c -o usemem || exit -1

for i in `seq 1 $NUM_CPU`
do
        create_sparse_file sparse-$i $((10 * MEMTOTAL_BYTES / NUM_CPU))
        ./usemem sparse-$i $DURATION $((10 * MEMTOTAL_BYTES / NUM_CPU)) &
	echo $! >> memorypressure-sparsefile-$$.pids
done

# Wait for memory pressure programs to exit
EXITCODE=0
echo Waiting on helper programs to exit
for PID in `cat memorypressure-sparsefile-$$.pids`; do
	wait $PID
	if [ $? -ne 0 ]; then
		EXITCODE=$?
	fi
done

rm memorypressure-sparsefile-$$.pids
rm sparse-*
exit $EXITCODE

==== BEGIN C FILE ====
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>
#include <time.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>

void usage(void)
{
	fprintf(stderr, "usemem filename duration nr_bytes\n");
	exit(EXIT_FAILURE);
}

static int do_test(char *filename, unsigned long duration, unsigned long nr_bytes)
{
	unsigned long pagesize = getpagesize();
	int references = nr_bytes / pagesize;
	int fd;
	int read_index, discard_index;
	time_t endtime;
	void **fifo;
	char *file;
	int dummy;

	fifo = calloc(references, sizeof(void *));
	if (fifo == NULL) {
		fprintf(stderr, "Failed to malloc pointers\n");
		exit(EXIT_FAILURE);
	}

	fd = open(filename, O_RDONLY);
	if (fd == -1) {
		perror("open");
		exit(EXIT_FAILURE);
	}

	file = mmap(NULL, nr_bytes, PROT_READ, MAP_SHARED, fd, 0);
	if (file == MAP_FAILED) {
		perror("mmap");
		exit(EXIT_FAILURE);
	}

	endtime = time(NULL) + duration;
	read_index = 0;
	discard_index = 1;
	do {
		dummy += file[read_index * pagesize];
		/* WRITE TEST file[read_index * pagesize]++; */
		read_index = (read_index + 13) % references;
		discard_index = (discard_index + 13) % references;

		/* Could use this to strictly limit memory usage
		fifo[read_index] = &(file[read_index * pagesize]);
		if (fifo[discard_index]) {
			madvise(file + (discard_index * pagesize), pagesize, MADV_DONTNEED);
		}
		*/
	} while (time(NULL) < endtime);

	/* Prevents gcc optimising */
	return dummy == -1 ? EXIT_SUCCESS : EXIT_FAILURE;
}

int main(int argc, char **argv)
{
	char *filename;
	unsigned long duration, nr_bytes;

	if (argc != 4)
		usage();
	filename = argv[1];
	duration = strtoul(argv[2], NULL, 0);
	nr_bytes = strtoul(argv[3], NULL, 0);
	setbuf(stdout, NULL);
	setbuf(stderr, NULL);

	return do_test(filename, duration, nr_bytes);
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-10-11  8:56     ` Mel Gorman
@ 2010-10-12  1:05       ` Shaohua Li
  2010-10-12 16:25         ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: Shaohua Li @ 2010-10-12  1:05 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, cl

[-- Attachment #1: Type: text/plain, Size: 4194 bytes --]

On Mon, Oct 11, 2010 at 04:56:48PM +0800, Mel Gorman wrote:
> On Sat, Oct 09, 2010 at 08:58:07AM +0800, Shaohua Li wrote:
> > On Fri, Oct 08, 2010 at 11:29:53PM +0800, Mel Gorman wrote:
> > > On Tue, Sep 28, 2010 at 01:08:01PM +0800, Shaohua Li wrote:
> > > > In a 4 socket 64 CPU system, zone_nr_free_pages() takes about 5% ~ 10% cpu time
> > > > according to perf when memory pressure is high. The workload does something
> > > > like:
> > > > for i in `seq 1 $nr_cpu`
> > > > do
> > > >         create_sparse_file $SPARSE_FILE-$i $((10 * mem / nr_cpu))
> > > >         $USEMEM -f $SPARSE_FILE-$i -j 4096 --readonly $((10 * mem / nr_cpu)) &
> > > > done
> > > > this simply reads a sparse file for each CPU. Apparently the
> > > > zone->percpu_drift_mark is too big, and guess zone_page_state_snapshot() makes
> > > > a lot of cache bounce for ->vm_stat_diff[]. below is the zoneinfo for reference.
> > > 
> > > Would it be possible for you to post the oprofile report? I'm in the
> > > early stages of trying to reproduce this locally based on your test
> > > description. The first machine I tried showed that zone_nr_page_state
> > > was consuming 0.26% of profile time with the vast bulk occupied by
> > > do_mpage_readahead. See as follows
> > > 
> > > 1599339  53.3463  vmlinux-2.6.36-rc7-pcpudrift do_mpage_readpage
> > > 131713    4.3933  vmlinux-2.6.36-rc7-pcpudrift __isolate_lru_page
> > > 103958    3.4675  vmlinux-2.6.36-rc7-pcpudrift free_pcppages_bulk
> > > 85024     2.8360  vmlinux-2.6.36-rc7-pcpudrift __rmqueue
> > > 78697     2.6250  vmlinux-2.6.36-rc7-pcpudrift native_flush_tlb_others
> > > 75678     2.5243  vmlinux-2.6.36-rc7-pcpudrift unlock_page
> > > 68741     2.2929  vmlinux-2.6.36-rc7-pcpudrift get_page_from_freelist
> > > 56043     1.8693  vmlinux-2.6.36-rc7-pcpudrift __alloc_pages_nodemask
> > > 55863     1.8633  vmlinux-2.6.36-rc7-pcpudrift ____pagevec_lru_add
> > > 46044     1.5358  vmlinux-2.6.36-rc7-pcpudrift radix_tree_delete
> > > 44543     1.4857  vmlinux-2.6.36-rc7-pcpudrift shrink_page_list
> > > 33636     1.1219  vmlinux-2.6.36-rc7-pcpudrift zone_watermark_ok
> > > .....
> > > 7855      0.2620  vmlinux-2.6.36-rc7-pcpudrift zone_nr_free_pages
> > > 
> > > The machine I am testing on is non-NUMA 4-core single socket and totally
> > > different characteristics but I want to be sure I'm going more or less the
> > > right direction with the reproduction case before trying to find a larger
> > > machine.
> > Here it is. this is a 4 socket nahalem machine.
> >            268160.00 57.2% _raw_spin_lock                      /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> >             40302.00  8.6% zone_nr_free_pages                  /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> >             36827.00  7.9% do_mpage_readpage                   /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> >             28011.00  6.0% _raw_spin_lock_irq                  /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> >             22973.00  4.9% flush_tlb_others_ipi                /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> >             10713.00  2.3% smp_invalidate_interrupt            /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> 
> Ok, we are seeing *very* different things. Can you tell me more about
> what usemem actually does? I thought it might be doing something like
> mapping the file and just reading it but that doesn't appear to be the
> case. I also tried using madvise dropping pages to strictly limit how
> much memory was used but the profiles are still different.
> 
> I've posted the very basic test script I was using based on your
> description. Can you tell me what usemem does differently or better again,
> post the source of usemem? Can you also post your .config please. I'm curious
> to see why you are seeing so much more locking overhead. If you have lock
> debugging and lock stat enabled, would it be possible to test without them
> enabled to see what the profile looks like?
Basically the similar test. I'm using Fengguang's test, please check attached
file. I didn't enable lock stat or debug. The difference is my test is under a
4 socket system. In a 1 socket system, I don't see the issue too.

Thanks,
Shaohua

[-- Attachment #2: test.tgz --]
[-- Type: application/x-gtar, Size: 19817 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-10-12  1:05       ` Shaohua Li
@ 2010-10-12 16:25         ` Mel Gorman
  2010-10-13  2:41           ` Shaohua Li
  2010-10-13  3:36           ` KOSAKI Motohiro
  0 siblings, 2 replies; 65+ messages in thread
From: Mel Gorman @ 2010-10-12 16:25 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-mm, cl, Andrew Morton, David Rientjes

> > > > > In a 4 socket 64 CPU system, zone_nr_free_pages() takes about 5% ~ 10% cpu time
> > > > > according to perf when memory pressure is high. The workload does something
> > > > > like:
> > > > > for i in `seq 1 $nr_cpu`
> > > > > do
> > > > >         create_sparse_file $SPARSE_FILE-$i $((10 * mem / nr_cpu))
> > > > >         $USEMEM -f $SPARSE_FILE-$i -j 4096 --readonly $((10 * mem / nr_cpu)) &
> > > > > done
> > > > > this simply reads a sparse file for each CPU. Apparently the
> > > > > zone->percpu_drift_mark is too big, and guess zone_page_state_snapshot() makes
> > > > > a lot of cache bounce for ->vm_stat_diff[]. below is the zoneinfo for reference.
> > > > 
> > > > Would it be possible for you to post the oprofile report? I'm in the
> > > > early stages of trying to reproduce this locally based on your test
> > > > description. The first machine I tried showed that zone_nr_page_state
> > > > was consuming 0.26% of profile time with the vast bulk occupied by
> > > > do_mpage_readahead. See as follows
> > > > 
> > > > 1599339  53.3463  vmlinux-2.6.36-rc7-pcpudrift do_mpage_readpage
> > > > 131713    4.3933  vmlinux-2.6.36-rc7-pcpudrift __isolate_lru_page
> > > > 103958    3.4675  vmlinux-2.6.36-rc7-pcpudrift free_pcppages_bulk
> > > > 85024     2.8360  vmlinux-2.6.36-rc7-pcpudrift __rmqueue
> > > > 78697     2.6250  vmlinux-2.6.36-rc7-pcpudrift native_flush_tlb_others
> > > > 75678     2.5243  vmlinux-2.6.36-rc7-pcpudrift unlock_page
> > > > 68741     2.2929  vmlinux-2.6.36-rc7-pcpudrift get_page_from_freelist
> > > > 56043     1.8693  vmlinux-2.6.36-rc7-pcpudrift __alloc_pages_nodemask
> > > > 55863     1.8633  vmlinux-2.6.36-rc7-pcpudrift ____pagevec_lru_add
> > > > 46044     1.5358  vmlinux-2.6.36-rc7-pcpudrift radix_tree_delete
> > > > 44543     1.4857  vmlinux-2.6.36-rc7-pcpudrift shrink_page_list
> > > > 33636     1.1219  vmlinux-2.6.36-rc7-pcpudrift zone_watermark_ok
> > > > .....
> > > > 7855      0.2620  vmlinux-2.6.36-rc7-pcpudrift zone_nr_free_pages
> > > > 
> > > > The machine I am testing on is non-NUMA 4-core single socket and totally
> > > > different characteristics but I want to be sure I'm going more or less the
> > > > right direction with the reproduction case before trying to find a larger
> > > > machine.
> > > 
> > > Here it is. this is a 4 socket nahalem machine.
> > >            268160.00 57.2% _raw_spin_lock                      /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > >             40302.00  8.6% zone_nr_free_pages                  /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > >             36827.00  7.9% do_mpage_readpage                   /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > >             28011.00  6.0% _raw_spin_lock_irq                  /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > >             22973.00  4.9% flush_tlb_others_ipi                /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > >             10713.00  2.3% smp_invalidate_interrupt            /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > 
> > <SNIP>
> >
> Basically the similar test. I'm using Fengguang's test, please check attached
> file. I didn't enable lock stat or debug. The difference is my test is under a
> 4 socket system. In a 1 socket system, I don't see the issue too.
> 

Ok, finding a large enough machine was key here true enough. I don't
have access to Nehalem boxes but the same problem showed up on a large
ppc64 machine (8 socket, interestingly enough a 3 socket did not have any
significant problem).  Based on that, I reproduced the problem and came up
with the patch below.

Christoph, can you look at this please? I know you had concerns about adjusting
thresholds as being an expensive operation but the patch limits how often it
occurs and it seems better than reducing thresholds for the full lifetime of
the system just to avoid counter drift. What I did find with the patch that
the overhead of __mod_zone_page_state() increases because of the temporarily
reduced threshold. It goes from 0.0403% of profile time to 0.0967% on one
machine and from 0.0677% to 0.43% on another. As this is just while kswapd
is awake, it seems withiin an acceptable margin but it is a caution against
simply reducing the existing thresholds. What is more relevant is the time
to complete the benchmark is increased due to the reduction of the thresholds.
This is a tradeoff between being fast and safe but I'm open to
suggestions on how high a safe threshold might be.

Shaohua, can you test keeping an eye out for any additional function
that is now taking a lot more CPU time?

==== CUT HERE ====
mm: page allocator: Adjust the per-cpu counter threshold when memory is low

Commit [aa45484: calculate a better estimate of NR_FREE_PAGES when
memory is low] noted that watermarks were based on the vmstat
NR_FREE_PAGES. To avoid synchronization overhead, these counters are
maintained on a per-cpu basis and drained both periodically and when a
threshold is above a threshold. On large CPU systems, the difference
between the estimate and real value of NR_FREE_PAGES can be very high.
The system can get into a case where pages are allocated far below the
min watermark potentially causing livelock issues. The commit solved the
problem by taking a better reading of NR_FREE_PAGES when memory was low.

Unfortately, as reported by Shaohua Li this accurate reading can consume
a large amount of CPU time on systems with many sockets due to cache
line bouncing. This patch takes a different approach. For large machines
where counter drift might be unsafe and while kswapd is awake, the per-cpu
thresholds for the target pgdat are reduced to limit the level of drift
to what should be a safe level. This incurs a performance penalty in heavy
memory pressure by a factor that depends on the workload and the machine but
the machine should function correctly without accidentally exhausting all
memory on a node. There is an additional cost when kswapd wakes and sleeps
but the event is not expected to be frequent - in Shaohua's test case,
there was one recorded sleep and wake event at least.

Reported-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |    6 ------
 include/linux/vmstat.h |    2 ++
 mm/mmzone.c            |   21 ---------------------
 mm/page_alloc.c        |    4 ++--
 mm/vmscan.c            |    2 ++
 mm/vmstat.c            |   42 +++++++++++++++++++++++++++++++++++++++++-
 6 files changed, 47 insertions(+), 30 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3984c4e..343fd5c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -448,12 +448,6 @@ static inline int zone_is_oom_locked(const struct zone *zone)
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
 }
 
-#ifdef CONFIG_SMP
-unsigned long zone_nr_free_pages(struct zone *zone);
-#else
-#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
-#endif /* CONFIG_SMP */
-
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index eaaea37..c67d333 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -254,6 +254,8 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
 void refresh_cpu_vm_stats(int);
+void disable_pgdat_percpu_threshold(pg_data_t *pgdat);
+void enable_pgdat_percpu_threshold(pg_data_t *pgdat);
 #else /* CONFIG_SMP */
 
 /*
diff --git a/mm/mmzone.c b/mm/mmzone.c
index e35bfb8..f5b7d17 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,24 +87,3 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
-
-#ifdef CONFIG_SMP
-/* Called when a more accurate view of NR_FREE_PAGES is needed */
-unsigned long zone_nr_free_pages(struct zone *zone)
-{
-	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
-
-	/*
-	 * While kswapd is awake, it is considered the zone is under some
-	 * memory pressure. Under pressure, there is a risk that
-	 * per-cpu-counter-drift will allow the min watermark to be breached
-	 * potentially causing a live-lock. While kswapd is awake and
-	 * free pages are low, get a better estimate for free pages
-	 */
-	if (nr_free_pages < zone->percpu_drift_mark &&
-			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
-		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
-
-	return nr_free_pages;
-}
-#endif /* CONFIG_SMP */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a8cfa9c..a9b4542 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
+	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
 	int o;
 
 	if (alloc_flags & ALLOC_HIGH)
@@ -2436,7 +2436,7 @@ void show_free_areas(void)
 			" all_unreclaimable? %s"
 			"\n",
 			zone->name,
-			K(zone_nr_free_pages(zone)),
+			K(zone_page_state(zone, NR_FREE_PAGES)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c5dfabf..47ba29e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2378,7 +2378,9 @@ static int kswapd(void *p)
 				 */
 				if (!sleeping_prematurely(pgdat, order, remaining)) {
 					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
+					enable_pgdat_percpu_threshold(pgdat);
 					schedule();
+					disable_pgdat_percpu_threshold(pgdat);
 				} else {
 					if (remaining)
 						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 355a9e6..19bd4a1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -81,6 +81,12 @@ EXPORT_SYMBOL(vm_stat);
 
 #ifdef CONFIG_SMP
 
+static int calculate_pressure_threshold(struct zone *zone)
+{
+	return max(1, (int)((high_wmark_pages(zone) - low_wmark_pages(zone) /
+				num_online_cpus())));
+}
+
 static int calculate_threshold(struct zone *zone)
 {
 	int threshold;
@@ -159,6 +165,40 @@ static void refresh_zone_stat_thresholds(void)
 	}
 }
 
+void disable_pgdat_percpu_threshold(pg_data_t *pgdat)
+{
+	struct zone *zone;
+	int cpu;
+	int threshold;
+
+	for_each_populated_zone(zone) {
+		if (!zone->percpu_drift_mark || zone->zone_pgdat != pgdat)
+			continue;
+
+		threshold = calculate_pressure_threshold(zone);
+		for_each_online_cpu(cpu)
+			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+							= threshold;
+	}
+}
+
+void enable_pgdat_percpu_threshold(pg_data_t *pgdat)
+{
+	struct zone *zone;
+	int cpu;
+	int threshold;
+
+	for_each_populated_zone(zone) {
+		if (!zone->percpu_drift_mark || zone->zone_pgdat != pgdat)
+			continue;
+
+		threshold = calculate_threshold(zone);
+		for_each_online_cpu(cpu)
+			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+							= threshold;
+	}
+}
+
 /*
  * For use when we know that interrupts are disabled.
  */
@@ -826,7 +866,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
-		   zone_nr_free_pages(zone),
+		   zone_page_state(zone, NR_FREE_PAGES),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-10-12 16:25         ` Mel Gorman
@ 2010-10-13  2:41           ` Shaohua Li
  2010-10-13 12:09             ` Mel Gorman
  2010-10-13  3:36           ` KOSAKI Motohiro
  1 sibling, 1 reply; 65+ messages in thread
From: Shaohua Li @ 2010-10-13  2:41 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, cl, Andrew Morton, David Rientjes

On Wed, Oct 13, 2010 at 12:25:26AM +0800, Mel Gorman wrote:
> > > > > > In a 4 socket 64 CPU system, zone_nr_free_pages() takes about 5% ~ 10% cpu time
> > > > > > according to perf when memory pressure is high. The workload does something
> > > > > > like:
> > > > > > for i in `seq 1 $nr_cpu`
> > > > > > do
> > > > > >         create_sparse_file $SPARSE_FILE-$i $((10 * mem / nr_cpu))
> > > > > >         $USEMEM -f $SPARSE_FILE-$i -j 4096 --readonly $((10 * mem / nr_cpu)) &
> > > > > > done
> > > > > > this simply reads a sparse file for each CPU. Apparently the
> > > > > > zone->percpu_drift_mark is too big, and guess zone_page_state_snapshot() makes
> > > > > > a lot of cache bounce for ->vm_stat_diff[]. below is the zoneinfo for reference.
> > > > > 
> > > > > Would it be possible for you to post the oprofile report? I'm in the
> > > > > early stages of trying to reproduce this locally based on your test
> > > > > description. The first machine I tried showed that zone_nr_page_state
> > > > > was consuming 0.26% of profile time with the vast bulk occupied by
> > > > > do_mpage_readahead. See as follows
> > > > > 
> > > > > 1599339  53.3463  vmlinux-2.6.36-rc7-pcpudrift do_mpage_readpage
> > > > > 131713    4.3933  vmlinux-2.6.36-rc7-pcpudrift __isolate_lru_page
> > > > > 103958    3.4675  vmlinux-2.6.36-rc7-pcpudrift free_pcppages_bulk
> > > > > 85024     2.8360  vmlinux-2.6.36-rc7-pcpudrift __rmqueue
> > > > > 78697     2.6250  vmlinux-2.6.36-rc7-pcpudrift native_flush_tlb_others
> > > > > 75678     2.5243  vmlinux-2.6.36-rc7-pcpudrift unlock_page
> > > > > 68741     2.2929  vmlinux-2.6.36-rc7-pcpudrift get_page_from_freelist
> > > > > 56043     1.8693  vmlinux-2.6.36-rc7-pcpudrift __alloc_pages_nodemask
> > > > > 55863     1.8633  vmlinux-2.6.36-rc7-pcpudrift ____pagevec_lru_add
> > > > > 46044     1.5358  vmlinux-2.6.36-rc7-pcpudrift radix_tree_delete
> > > > > 44543     1.4857  vmlinux-2.6.36-rc7-pcpudrift shrink_page_list
> > > > > 33636     1.1219  vmlinux-2.6.36-rc7-pcpudrift zone_watermark_ok
> > > > > .....
> > > > > 7855      0.2620  vmlinux-2.6.36-rc7-pcpudrift zone_nr_free_pages
> > > > > 
> > > > > The machine I am testing on is non-NUMA 4-core single socket and totally
> > > > > different characteristics but I want to be sure I'm going more or less the
> > > > > right direction with the reproduction case before trying to find a larger
> > > > > machine.
> > > > 
> > > > Here it is. this is a 4 socket nahalem machine.
> > > >            268160.00 57.2% _raw_spin_lock                      /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > > >             40302.00  8.6% zone_nr_free_pages                  /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > > >             36827.00  7.9% do_mpage_readpage                   /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > > >             28011.00  6.0% _raw_spin_lock_irq                  /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > > >             22973.00  4.9% flush_tlb_others_ipi                /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > > >             10713.00  2.3% smp_invalidate_interrupt            /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > > 
> > > <SNIP>
> > >
> > Basically the similar test. I'm using Fengguang's test, please check attached
> > file. I didn't enable lock stat or debug. The difference is my test is under a
> > 4 socket system. In a 1 socket system, I don't see the issue too.
> > 
> 
> Ok, finding a large enough machine was key here true enough. I don't
> have access to Nehalem boxes but the same problem showed up on a large
> ppc64 machine (8 socket, interestingly enough a 3 socket did not have any
> significant problem).  Based on that, I reproduced the problem and came up
> with the patch below.
> 
> Christoph, can you look at this please? I know you had concerns about adjusting
> thresholds as being an expensive operation but the patch limits how often it
> occurs and it seems better than reducing thresholds for the full lifetime of
> the system just to avoid counter drift. What I did find with the patch that
> the overhead of __mod_zone_page_state() increases because of the temporarily
> reduced threshold. It goes from 0.0403% of profile time to 0.0967% on one
> machine and from 0.0677% to 0.43% on another. As this is just while kswapd
> is awake, it seems withiin an acceptable margin but it is a caution against
> simply reducing the existing thresholds. What is more relevant is the time
> to complete the benchmark is increased due to the reduction of the thresholds.
> This is a tradeoff between being fast and safe but I'm open to
> suggestions on how high a safe threshold might be.
> 
> Shaohua, can you test keeping an eye out for any additional function
> that is now taking a lot more CPU time?
seems ok so far in the 4 sockets system. In this system, each node has 8G
memory, so the threshold is 5 with memory pressure. Haven't tested this
in some small machines yet, for example, each node just has 4G memory, etc.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-10-12 16:25         ` Mel Gorman
  2010-10-13  2:41           ` Shaohua Li
@ 2010-10-13  3:36           ` KOSAKI Motohiro
  2010-10-13  6:25             ` [RFC][PATCH 0/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot() KOSAKI Motohiro
  2010-10-13 11:24             ` zone state overhead Mel Gorman
  1 sibling, 2 replies; 65+ messages in thread
From: KOSAKI Motohiro @ 2010-10-13  3:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Shaohua Li, linux-mm, cl, Andrew Morton, David Rientjes

Hello, 

> ==== CUT HERE ====
> mm: page allocator: Adjust the per-cpu counter threshold when memory is low
> 
> Commit [aa45484: calculate a better estimate of NR_FREE_PAGES when
> memory is low] noted that watermarks were based on the vmstat
> NR_FREE_PAGES. To avoid synchronization overhead, these counters are
> maintained on a per-cpu basis and drained both periodically and when a
> threshold is above a threshold. On large CPU systems, the difference
> between the estimate and real value of NR_FREE_PAGES can be very high.
> The system can get into a case where pages are allocated far below the
> min watermark potentially causing livelock issues. The commit solved the
> problem by taking a better reading of NR_FREE_PAGES when memory was low.
> 
> Unfortately, as reported by Shaohua Li this accurate reading can consume
> a large amount of CPU time on systems with many sockets due to cache
> line bouncing. This patch takes a different approach. For large machines
> where counter drift might be unsafe and while kswapd is awake, the per-cpu
> thresholds for the target pgdat are reduced to limit the level of drift
> to what should be a safe level. This incurs a performance penalty in heavy
> memory pressure by a factor that depends on the workload and the machine but
> the machine should function correctly without accidentally exhausting all
> memory on a node. There is an additional cost when kswapd wakes and sleeps
> but the event is not expected to be frequent - in Shaohua's test case,
> there was one recorded sleep and wake event at least.

Interesting. I've reveiwed this one.


> 
> Reported-by: Shaohua Li <shaohua.li@intel.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  include/linux/mmzone.h |    6 ------
>  include/linux/vmstat.h |    2 ++
>  mm/mmzone.c            |   21 ---------------------
>  mm/page_alloc.c        |    4 ++--
>  mm/vmscan.c            |    2 ++
>  mm/vmstat.c            |   42 +++++++++++++++++++++++++++++++++++++++++-
>  6 files changed, 47 insertions(+), 30 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 3984c4e..343fd5c 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -448,12 +448,6 @@ static inline int zone_is_oom_locked(const struct zone *zone)
>  	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
>  }
>  
> -#ifdef CONFIG_SMP
> -unsigned long zone_nr_free_pages(struct zone *zone);
> -#else
> -#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
> -#endif /* CONFIG_SMP */
> -
>  /*
>   * The "priority" of VM scanning is how much of the queues we will scan in one
>   * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index eaaea37..c67d333 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -254,6 +254,8 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
>  extern void __dec_zone_state(struct zone *, enum zone_stat_item);
>  
>  void refresh_cpu_vm_stats(int);
> +void disable_pgdat_percpu_threshold(pg_data_t *pgdat);
> +void enable_pgdat_percpu_threshold(pg_data_t *pgdat);
>  #else /* CONFIG_SMP */
>  
>  /*
> diff --git a/mm/mmzone.c b/mm/mmzone.c
> index e35bfb8..f5b7d17 100644
> --- a/mm/mmzone.c
> +++ b/mm/mmzone.c
> @@ -87,24 +87,3 @@ int memmap_valid_within(unsigned long pfn,
>  	return 1;
>  }
>  #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
> -
> -#ifdef CONFIG_SMP
> -/* Called when a more accurate view of NR_FREE_PAGES is needed */
> -unsigned long zone_nr_free_pages(struct zone *zone)
> -{
> -	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> -
> -	/*
> -	 * While kswapd is awake, it is considered the zone is under some
> -	 * memory pressure. Under pressure, there is a risk that
> -	 * per-cpu-counter-drift will allow the min watermark to be breached
> -	 * potentially causing a live-lock. While kswapd is awake and
> -	 * free pages are low, get a better estimate for free pages
> -	 */
> -	if (nr_free_pages < zone->percpu_drift_mark &&
> -			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
> -		return zone_page_state_snapshot(zone, NR_FREE_PAGES);

Now, we can remove zone_page_state_snapshot() too.


> -
> -	return nr_free_pages;
> -}
> -#endif /* CONFIG_SMP */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a8cfa9c..a9b4542 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
>  {
>  	/* free_pages my go negative - that's OK */
>  	long min = mark;
> -	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
> +	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
>  	int o;
>  
>  	if (alloc_flags & ALLOC_HIGH)
> @@ -2436,7 +2436,7 @@ void show_free_areas(void)
>  			" all_unreclaimable? %s"
>  			"\n",
>  			zone->name,
> -			K(zone_nr_free_pages(zone)),
> +			K(zone_page_state(zone, NR_FREE_PAGES)),
>  			K(min_wmark_pages(zone)),
>  			K(low_wmark_pages(zone)),
>  			K(high_wmark_pages(zone)),
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c5dfabf..47ba29e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2378,7 +2378,9 @@ static int kswapd(void *p)
>  				 */
>  				if (!sleeping_prematurely(pgdat, order, remaining)) {
>  					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> +					enable_pgdat_percpu_threshold(pgdat);
>  					schedule();
> +					disable_pgdat_percpu_threshold(pgdat);

If we have 4096 cpus, max drift = 125x4096x4096 ~= 2GB. It is higher than zone watermark.
Then, such sysmtem can makes memory exshost before kswap call disable_pgdat_percpu_threshold().

Hmmm....
This seems fundamental problem. current our zone watermark and per-cpu stat threshold have completely
unbalanced definition.

zone watermak:             very few (few mega bytes)
                                       propotional sqrt(mem)
                                       no propotional nr-cpus

per-cpu stat threshold:  relatively large (desktop: few mega bytes, server ~50MB, SGI 2GB ;-)
                                       propotional log(mem)
                                       propotional log(nr-cpus)

It mean, much cpus break watermark assumption.....



>  				} else {
>  					if (remaining)
>  						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 355a9e6..19bd4a1 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -81,6 +81,12 @@ EXPORT_SYMBOL(vm_stat);
>  
>  #ifdef CONFIG_SMP
>  
> +static int calculate_pressure_threshold(struct zone *zone)
> +{
> +	return max(1, (int)((high_wmark_pages(zone) - low_wmark_pages(zone) /
> +				num_online_cpus())));
> +}
> +
>  static int calculate_threshold(struct zone *zone)
>  {
>  	int threshold;
> @@ -159,6 +165,40 @@ static void refresh_zone_stat_thresholds(void)
>  	}
>  }
>  
> +void disable_pgdat_percpu_threshold(pg_data_t *pgdat)
> +{
> +	struct zone *zone;
> +	int cpu;
> +	int threshold;
> +
> +	for_each_populated_zone(zone) {
> +		if (!zone->percpu_drift_mark || zone->zone_pgdat != pgdat)
> +			continue;

We don't use for_each_populated_zone() and "zone->zone_pgdat != pgdat" combination
in almost place.

        for (i = 0; i < pgdat->nr_zones; i++) {

is enough?

> +
> +		threshold = calculate_pressure_threshold(zone);
> +		for_each_online_cpu(cpu)
> +			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> +							= threshold;
>
> +	}
> +}
> +
> +void enable_pgdat_percpu_threshold(pg_data_t *pgdat)
> +{
> +	struct zone *zone;
> +	int cpu;
> +	int threshold;
> +
> +	for_each_populated_zone(zone) {
> +		if (!zone->percpu_drift_mark || zone->zone_pgdat != pgdat)
> +			continue;
> +
> +		threshold = calculate_threshold(zone);
> +		for_each_online_cpu(cpu)
> +			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> +							= threshold;
> +	}
> +}

disable_pgdat_percpu_threshold() and enable_pgdat_percpu_threshold() are
almostly same. can you merge them?


> +
>  /*
>   * For use when we know that interrupts are disabled.
>   */
> @@ -826,7 +866,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   "\n        scanned  %lu"
>  		   "\n        spanned  %lu"
>  		   "\n        present  %lu",
> -		   zone_nr_free_pages(zone),
> +		   zone_page_state(zone, NR_FREE_PAGES),
>  		   min_wmark_pages(zone),
>  		   low_wmark_pages(zone),
>  		   high_wmark_pages(zone),
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC][PATCH 0/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot()
  2010-10-13  3:36           ` KOSAKI Motohiro
@ 2010-10-13  6:25             ` KOSAKI Motohiro
  2010-10-13  6:27               ` [RFC][PATCH 1/3] mm, mem-hotplug: recalculate lowmem_reserve when memory hotplug occur KOSAKI Motohiro
                                 ` (3 more replies)
  2010-10-13 11:24             ` zone state overhead Mel Gorman
  1 sibling, 4 replies; 65+ messages in thread
From: KOSAKI Motohiro @ 2010-10-13  6:25 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Shaohua Li, linux-mm, cl, Andrew Morton, David Rientjes

> > @@ -2378,7 +2378,9 @@ static int kswapd(void *p)
> >  				 */
> >  				if (!sleeping_prematurely(pgdat, order, remaining)) {
> >  					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> > +					enable_pgdat_percpu_threshold(pgdat);
> >  					schedule();
> > +					disable_pgdat_percpu_threshold(pgdat);
> 
> If we have 4096 cpus, max drift = 125x4096x4096 ~= 2GB. It is higher than zone watermark.
> Then, such sysmtem can makes memory exshost before kswap call disable_pgdat_percpu_threshold().
> 
> Hmmm....
> This seems fundamental problem. current our zone watermark and per-cpu stat threshold have completely
> unbalanced definition.
> 
> zone watermak:             very few (few mega bytes)
>                                        propotional sqrt(mem)
>                                        no propotional nr-cpus
> 
> per-cpu stat threshold:  relatively large (desktop: few mega bytes, server ~50MB, SGI 2GB ;-)
>                                        propotional log(mem)
>                                        propotional log(nr-cpus)
> 
> It mean, much cpus break watermark assumption.....

I've tryied to implement different patch.
The key idea is, max-drift is very small value if the system don't have
>1024CPU.

three model case: 

    Case1: typical desktop
      CPU: 2
      MEM: 2GB
      max-drift = 2 x log2(2) x log2(2x1024/128) x 2 = 40 pages

    Case2: relatively large server
      CPU: 64
      MEM: 8GBx4 (=32GB)
      max-drift = 2 x log2(64) x log2(8x1024/128) x 64 = 6272 pages = 24.5MB

    Case3: ultimately big server
      CPU: 2048
      MEM: 64GBx256 (=16TB)
      max-drift = 125 x 2048 = 256000 pages = 1000MB


So, I think we can accept 20MB memory waste for good performance. but can't
accept 1000MB waste ;-)
Fortunatelly, >1000CPU machine are always used for HPC in nowadays. then,
zone_page_state_snapshot() overhead isn't so big matter on it.

So, My idea is,

Case1 and Case2: reserve max-drift pages at first.
Case3:           using zone_page_state_snapshot()




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC][PATCH 1/3] mm, mem-hotplug: recalculate lowmem_reserve when memory hotplug occur
  2010-10-13  6:25             ` [RFC][PATCH 0/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot() KOSAKI Motohiro
@ 2010-10-13  6:27               ` KOSAKI Motohiro
  2010-10-13  6:39                 ` KAMEZAWA Hiroyuki
  2010-10-13 12:59                 ` Mel Gorman
  2010-10-13  6:28               ` [RFC][PATCH 2/3] mm: update pcp->stat_threshold " KOSAKI Motohiro
                                 ` (2 subsequent siblings)
  3 siblings, 2 replies; 65+ messages in thread
From: KOSAKI Motohiro @ 2010-10-13  6:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Shaohua Li, linux-mm, cl, Andrew Morton, David Rientjes

Currently, memory hotplu call setup_per_zone_wmarks() and
calculate_zone_inactive_ratio(), but don't call setup_per_zone_lowmem_reserve.

It mean number of reserved pages aren't updated even if memory hot plug
occur. This patch fixes it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/mm.h  |    3 +--
 mm/memory_hotplug.c |    9 +++++----
 mm/page_alloc.c     |    6 +++---
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6417c21..4607ec7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1203,8 +1203,7 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn);
 extern void set_dma_reserve(unsigned long new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long,
 				unsigned long, enum memmap_context);
-extern void setup_per_zone_wmarks(void);
-extern void calculate_zone_inactive_ratio(struct zone *zone);
+extern int __meminit init_per_zone_wmark_min(void);
 extern void mem_init(void);
 extern void __init mmap_init(void);
 extern void show_mem(void);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d8375bb..27d580d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -437,8 +437,9 @@ int online_pages(unsigned long pfn, unsigned long nr_pages)
 		zone_pcp_update(zone);
 
 	mutex_unlock(&zonelists_mutex);
-	setup_per_zone_wmarks();
-	calculate_zone_inactive_ratio(zone);
+
+	init_per_zone_wmark_min();
+
 	if (onlined_pages) {
 		kswapd_run(zone_to_nid(zone));
 		node_set_state(zone_to_nid(zone), N_HIGH_MEMORY);
@@ -872,8 +873,8 @@ repeat:
 	zone->zone_pgdat->node_present_pages -= offlined_pages;
 	totalram_pages -= offlined_pages;
 
-	setup_per_zone_wmarks();
-	calculate_zone_inactive_ratio(zone);
+	init_per_zone_wmark_min();
+
 	if (!node_present_pages(node)) {
 		node_clear_state(node, N_HIGH_MEMORY);
 		kswapd_stop(node);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 01fa206..6846096 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4880,7 +4880,7 @@ static void setup_per_zone_lowmem_reserve(void)
  * Ensures that the watermark[min,low,high] values for each zone are set
  * correctly with respect to min_free_kbytes.
  */
-void setup_per_zone_wmarks(void)
+static void setup_per_zone_wmarks(void)
 {
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
@@ -4956,7 +4956,7 @@ void setup_per_zone_wmarks(void)
  *    1TB     101        10GB
  *   10TB     320        32GB
  */
-void calculate_zone_inactive_ratio(struct zone *zone)
+static void calculate_zone_inactive_ratio(struct zone *zone)
 {
 	unsigned int gb, ratio;
 
@@ -5002,7 +5002,7 @@ static void __init setup_per_zone_inactive_ratio(void)
  * 8192MB:	11584k
  * 16384MB:	16384k
  */
-static int __init init_per_zone_wmark_min(void)
+int __meminit init_per_zone_wmark_min(void)
 {
 	unsigned long lowmem_kbytes;
 
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC][PATCH 2/3] mm: update pcp->stat_threshold when memory hotplug occur
  2010-10-13  6:25             ` [RFC][PATCH 0/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot() KOSAKI Motohiro
  2010-10-13  6:27               ` [RFC][PATCH 1/3] mm, mem-hotplug: recalculate lowmem_reserve when memory hotplug occur KOSAKI Motohiro
@ 2010-10-13  6:28               ` KOSAKI Motohiro
  2010-10-13  6:40                 ` KAMEZAWA Hiroyuki
  2010-10-13 13:02                 ` Mel Gorman
  2010-10-13  6:32               ` [RFC][PATCH 3/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot() KOSAKI Motohiro
  2010-10-13  7:10               ` [experimental][PATCH] mm,vmstat: per cpu stat flush too when per cpu page cache flushed KOSAKI Motohiro
  3 siblings, 2 replies; 65+ messages in thread
From: KOSAKI Motohiro @ 2010-10-13  6:28 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Shaohua Li, linux-mm, cl, Andrew Morton,
	David Rientjes, KAMEZAWA Hiroyuki

Currently, cpu hotplug updates pcp->stat_threashold, but memory
hotplug doesn't. there is no reason.

Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/vmstat.h |    5 ++++-
 mm/page_alloc.c        |    3 +++
 mm/vmstat.c            |    5 ++---
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index eaaea37..1997988 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -254,6 +254,7 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
 void refresh_cpu_vm_stats(int);
+void refresh_zone_stat_thresholds(void);
 #else /* CONFIG_SMP */
 
 /*
@@ -299,6 +300,8 @@ static inline void __dec_zone_page_state(struct page *page,
 #define mod_zone_page_state __mod_zone_page_state
 
 static inline void refresh_cpu_vm_stats(int cpu) { }
-#endif
+static inline void refresh_zone_stat_thresholds(void) { }
+
+#endif /* CONFIG_SMP */
 
 #endif /* _LINUX_VMSTAT_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6846096..53627fa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -51,6 +51,7 @@
 #include <linux/kmemleak.h>
 #include <linux/memory.h>
 #include <linux/compaction.h>
+#include <linux/vmstat.h>
 #include <trace/events/kmem.h>
 #include <linux/ftrace_event.h>
 
@@ -5013,6 +5014,8 @@ int __meminit init_per_zone_wmark_min(void)
 		min_free_kbytes = 128;
 	if (min_free_kbytes > 65536)
 		min_free_kbytes = 65536;
+
+	refresh_zone_stat_thresholds();
 	setup_per_zone_wmarks();
 	setup_per_zone_lowmem_reserve();
 	setup_per_zone_inactive_ratio();
diff --git a/mm/vmstat.c b/mm/vmstat.c
index baa4ab3..48b0463 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -132,7 +132,7 @@ static int calculate_threshold(struct zone *zone)
 /*
  * Refresh the thresholds for each zone.
  */
-static void refresh_zone_stat_thresholds(void)
+void refresh_zone_stat_thresholds(void)
 {
 	struct zone *zone;
 	int cpu;
@@ -370,7 +370,7 @@ void refresh_cpu_vm_stats(int cpu)
 			atomic_long_add(global_diff[i], &vm_stat[i]);
 }
 
-#endif
+#endif /* CONFIG_SMP */
 
 #ifdef CONFIG_NUMA
 /*
@@ -1057,7 +1057,6 @@ static int __init setup_vmstat(void)
 #ifdef CONFIG_SMP
 	int cpu;
 
-	refresh_zone_stat_thresholds();
 	register_cpu_notifier(&vmstat_notifier);
 
 	for_each_online_cpu(cpu)
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC][PATCH 3/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot()
  2010-10-13  6:25             ` [RFC][PATCH 0/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot() KOSAKI Motohiro
  2010-10-13  6:27               ` [RFC][PATCH 1/3] mm, mem-hotplug: recalculate lowmem_reserve when memory hotplug occur KOSAKI Motohiro
  2010-10-13  6:28               ` [RFC][PATCH 2/3] mm: update pcp->stat_threshold " KOSAKI Motohiro
@ 2010-10-13  6:32               ` KOSAKI Motohiro
  2010-10-13 13:19                 ` Mel Gorman
  2010-10-13  7:10               ` [experimental][PATCH] mm,vmstat: per cpu stat flush too when per cpu page cache flushed KOSAKI Motohiro
  3 siblings, 1 reply; 65+ messages in thread
From: KOSAKI Motohiro @ 2010-10-13  6:32 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Shaohua Li, linux-mm, cl, Andrew Morton,
	David Rientjes, KAMEZAWA Hiroyuki

Shaohua Li reported commit aa45484031(mm: page allocator: calculate a
better estimate of NR_FREE_PAGES when memory is low and kswapd is awake)
made performance regression.

| In a 4 socket 64 CPU system, zone_nr_free_pages() takes about 5% ~ 10%
| cpu time
| according to perf when memory pressure is high. The workload does
| something like:
| for i in `seq 1 $nr_cpu`
| do
|        create_sparse_file $SPARSE_FILE-$i $((10 * mem / nr_cpu))
|        $USEMEM -f $SPARSE_FILE-$i -j 4096 --readonly $((10 * mem / nr_cpu)) &
| done
| this simply reads a sparse file for each CPU. Apparently the
| zone->percpu_drift_mark is too big, and guess zone_page_state_snapshot()
| makes a lot of cache bounce for ->vm_stat_diff[]. below is the zoneinfo for
| reference.
| Is there any way to reduce the overhead?
|
| Node 3, zone   Normal
| pages free     2055926
|         min      1441
|         low      1801
|         high     2161
|         scanned  0
|         spanned  2097152
|         present  2068480
|   vm stats threshold: 98

It mean zone_page_state_snapshot() is costly than we expected. This
patch introduced very different approach. we are reserving max-drift pages
at first instead runtime free page calculation.

But, this technique can't be used on much cpus and few memory systems.
On such system, we still need to use zone_page_state_snapshot().

Example1: typical desktop
  CPU: 2
  MEM: 2GB

  old) zone->min = sqrt(2x1024x1024x16) = 5792 KB = 1448 pages
  new) max-drift = 2 x log2(2) x log2(2x1024/128) x 2 = 40
       zone->min = 1448 + 40 = 1488 pages

Example2: relatively large server
  CPU: 64
  MEM: 8GBx4 (=32GB)

  old) zone->min = sqrt(32x1024x1024x16)/4 = 5792 KB = 1448 pages
  new) max-drift = 2 x log2(64) x log2(8x1024/128) x 64 = 6272 pages
       zone->min = 1448 + 6272 = 7720 pages

  Hmm, zone->min became almost 5x times. Is it acceptable? I think yes.
  Today, we can buy 8GB DRAM for $20. So, 6272 pages (=24.5MB) waste
  mean about 6 cent waste. It's good deal for getting good performance.

Example3: ultimately big server
  CPU: 2048
  MEM: 64GBx256 (=16TB)

  old) zone->min = sqrt(16x1024x1024x1024x16)/256 = 2048 KB = 512 pages
                                      (Wow!, it's smaller than desktop)
  new) max-drift = 125 x 2048 = 256000 pages = 1000MB (greater than 64GB/100)
       zone->min = 512 pages

Reported-by: Shaohua Li <shaohua.li@intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 53627fa..194bdaa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4897,6 +4897,15 @@ static void setup_per_zone_wmarks(void)
 	for_each_zone(zone) {
 		u64 tmp;
 
+		/*
+		 * If max drift are less than 1%, reserve max drift pages
+		 * instead costly runtime calculation.
+		 */
+		if (zone->percpu_drift_mark < (zone->present_pages/100)) {
+			pages_min += zone->percpu_drift_mark;
+			zone->percpu_drift_mark = 0;
+		}
+
 		spin_lock_irqsave(&zone->lock, flags);
 		tmp = (u64)pages_min * zone->present_pages;
 		do_div(tmp, lowmem_pages);
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH 1/3] mm, mem-hotplug: recalculate lowmem_reserve when memory hotplug occur
  2010-10-13  6:27               ` [RFC][PATCH 1/3] mm, mem-hotplug: recalculate lowmem_reserve when memory hotplug occur KOSAKI Motohiro
@ 2010-10-13  6:39                 ` KAMEZAWA Hiroyuki
  2010-10-13 12:59                 ` Mel Gorman
  1 sibling, 0 replies; 65+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-13  6:39 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Shaohua Li, linux-mm, cl, Andrew Morton, David Rientjes

On Wed, 13 Oct 2010 15:27:12 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Currently, memory hotplu call setup_per_zone_wmarks() and
> calculate_zone_inactive_ratio(), but don't call setup_per_zone_lowmem_reserve.
> 
> It mean number of reserved pages aren't updated even if memory hot plug
> occur. This patch fixes it.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH 2/3] mm: update pcp->stat_threshold when memory hotplug occur
  2010-10-13  6:28               ` [RFC][PATCH 2/3] mm: update pcp->stat_threshold " KOSAKI Motohiro
@ 2010-10-13  6:40                 ` KAMEZAWA Hiroyuki
  2010-10-13 13:02                 ` Mel Gorman
  1 sibling, 0 replies; 65+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-13  6:40 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Shaohua Li, linux-mm, cl, Andrew Morton, David Rientjes

On Wed, 13 Oct 2010 15:28:14 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Currently, cpu hotplug updates pcp->stat_threashold, but memory
> hotplug doesn't. there is no reason.
> 
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [experimental][PATCH] mm,vmstat: per cpu stat flush too when per cpu page cache flushed
  2010-10-13  6:25             ` [RFC][PATCH 0/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot() KOSAKI Motohiro
                                 ` (2 preceding siblings ...)
  2010-10-13  6:32               ` [RFC][PATCH 3/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot() KOSAKI Motohiro
@ 2010-10-13  7:10               ` KOSAKI Motohiro
  2010-10-13  7:16                 ` KAMEZAWA Hiroyuki
                                   ` (2 more replies)
  3 siblings, 3 replies; 65+ messages in thread
From: KOSAKI Motohiro @ 2010-10-13  7:10 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Shaohua Li, linux-mm, cl, Andrew Morton,
	David Rientjes, KAMEZAWA Hiroyuki

When memory shortage, we are using drain_pages() for flushing per cpu
page cache. In this case, per cpu stat should be flushed too. because
now we are under memory shortage and we need to know exact free pages.

Otherwise get_page_from_freelist() may fail even though pcp was flushed.


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/vmstat.h |    5 +++++
 mm/page_alloc.c        |    1 +
 mm/vmstat.c            |   12 ++++++++++++
 3 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 1997988..df777f4 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -253,6 +253,8 @@ extern void __inc_zone_state(struct zone *, enum zone_stat_item);
 extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
+void __flush_zone_state(struct zone *zone, enum zone_stat_item item);
+
 void refresh_cpu_vm_stats(int);
 void refresh_zone_stat_thresholds(void);
 #else /* CONFIG_SMP */
@@ -299,6 +301,9 @@ static inline void __dec_zone_page_state(struct page *page,
 #define dec_zone_page_state __dec_zone_page_state
 #define mod_zone_page_state __mod_zone_page_state
 
+static inline void
+__flush_zone_state(struct zone *zone, enum zone_stat_item item) { }
+
 static inline void refresh_cpu_vm_stats(int cpu) { }
 static inline void refresh_zone_stat_thresholds(void) { }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 194bdaa..8b50e52 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1093,6 +1093,7 @@ static void drain_pages(unsigned int cpu)
 		pcp = &pset->pcp;
 		free_pcppages_bulk(zone, pcp->count, pcp);
 		pcp->count = 0;
+		__flush_zone_state(zone, NR_FREE_PAGES);
 		local_irq_restore(flags);
 	}
 }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 48b0463..1ca04ec 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -233,6 +233,18 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 	}
 }
 
+void __flush_zone_state(struct zone *zone, enum zone_stat_item item)
+{
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
+	s8 *diff = pcp->vm_stat_diff + item;
+
+	if (!*diff)
+		return;
+
+	zone_page_state_add(*diff, zone, item);
+	*diff = 0;
+}
+
 void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
 	__inc_zone_state(page_zone(page), item);
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [experimental][PATCH] mm,vmstat: per cpu stat flush too when per cpu page cache flushed
  2010-10-13  7:10               ` [experimental][PATCH] mm,vmstat: per cpu stat flush too when per cpu page cache flushed KOSAKI Motohiro
@ 2010-10-13  7:16                 ` KAMEZAWA Hiroyuki
  2010-10-13 13:22                 ` Mel Gorman
  2010-10-18 15:51                 ` Christoph Lameter
  2 siblings, 0 replies; 65+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-13  7:16 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Shaohua Li, linux-mm, cl, Andrew Morton, David Rientjes

On Wed, 13 Oct 2010 16:10:43 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> When memory shortage, we are using drain_pages() for flushing per cpu
> page cache. In this case, per cpu stat should be flushed too. because
> now we are under memory shortage and we need to know exact free pages.
> 
> Otherwise get_page_from_freelist() may fail even though pcp was flushed.
> 
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

But it seems performance measurement is necessary.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-10-13  3:36           ` KOSAKI Motohiro
  2010-10-13  6:25             ` [RFC][PATCH 0/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot() KOSAKI Motohiro
@ 2010-10-13 11:24             ` Mel Gorman
  2010-10-14  3:07               ` KOSAKI Motohiro
  1 sibling, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2010-10-13 11:24 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Shaohua Li, linux-mm, cl, Andrew Morton, David Rientjes

On Wed, Oct 13, 2010 at 12:36:42PM +0900, KOSAKI Motohiro wrote:
> Hello, 
> 
> > ==== CUT HERE ====
> > mm: page allocator: Adjust the per-cpu counter threshold when memory is low
> > 
> > Commit [aa45484: calculate a better estimate of NR_FREE_PAGES when
> > memory is low] noted that watermarks were based on the vmstat
> > NR_FREE_PAGES. To avoid synchronization overhead, these counters are
> > maintained on a per-cpu basis and drained both periodically and when a
> > threshold is above a threshold. On large CPU systems, the difference
> > between the estimate and real value of NR_FREE_PAGES can be very high.
> > The system can get into a case where pages are allocated far below the
> > min watermark potentially causing livelock issues. The commit solved the
> > problem by taking a better reading of NR_FREE_PAGES when memory was low.
> > 
> > Unfortately, as reported by Shaohua Li this accurate reading can consume
> > a large amount of CPU time on systems with many sockets due to cache
> > line bouncing. This patch takes a different approach. For large machines
> > where counter drift might be unsafe and while kswapd is awake, the per-cpu
> > thresholds for the target pgdat are reduced to limit the level of drift
> > to what should be a safe level. This incurs a performance penalty in heavy
> > memory pressure by a factor that depends on the workload and the machine but
> > the machine should function correctly without accidentally exhausting all
> > memory on a node. There is an additional cost when kswapd wakes and sleeps
> > but the event is not expected to be frequent - in Shaohua's test case,
> > there was one recorded sleep and wake event at least.
> 
> Interesting. I've reveiwed this one.
> 

Thanks

> > Reported-by: Shaohua Li <shaohua.li@intel.com>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  include/linux/mmzone.h |    6 ------
> >  include/linux/vmstat.h |    2 ++
> >  mm/mmzone.c            |   21 ---------------------
> >  mm/page_alloc.c        |    4 ++--
> >  mm/vmscan.c            |    2 ++
> >  mm/vmstat.c            |   42 +++++++++++++++++++++++++++++++++++++++++-
> >  6 files changed, 47 insertions(+), 30 deletions(-)
> > 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 3984c4e..343fd5c 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -448,12 +448,6 @@ static inline int zone_is_oom_locked(const struct zone *zone)
> >  	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
> >  }
> >  
> > -#ifdef CONFIG_SMP
> > -unsigned long zone_nr_free_pages(struct zone *zone);
> > -#else
> > -#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
> > -#endif /* CONFIG_SMP */
> > -
> >  /*
> >   * The "priority" of VM scanning is how much of the queues we will scan in one
> >   * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index eaaea37..c67d333 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -254,6 +254,8 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
> >  extern void __dec_zone_state(struct zone *, enum zone_stat_item);
> >  
> >  void refresh_cpu_vm_stats(int);
> > +void disable_pgdat_percpu_threshold(pg_data_t *pgdat);
> > +void enable_pgdat_percpu_threshold(pg_data_t *pgdat);
> >  #else /* CONFIG_SMP */
> >  
> >  /*
> > diff --git a/mm/mmzone.c b/mm/mmzone.c
> > index e35bfb8..f5b7d17 100644
> > --- a/mm/mmzone.c
> > +++ b/mm/mmzone.c
> > @@ -87,24 +87,3 @@ int memmap_valid_within(unsigned long pfn,
> >  	return 1;
> >  }
> >  #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
> > -
> > -#ifdef CONFIG_SMP
> > -/* Called when a more accurate view of NR_FREE_PAGES is needed */
> > -unsigned long zone_nr_free_pages(struct zone *zone)
> > -{
> > -	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> > -
> > -	/*
> > -	 * While kswapd is awake, it is considered the zone is under some
> > -	 * memory pressure. Under pressure, there is a risk that
> > -	 * per-cpu-counter-drift will allow the min watermark to be breached
> > -	 * potentially causing a live-lock. While kswapd is awake and
> > -	 * free pages are low, get a better estimate for free pages
> > -	 */
> > -	if (nr_free_pages < zone->percpu_drift_mark &&
> > -			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
> > -		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
> 
> Now, we can remove zone_page_state_snapshot() too.
> 

Yes, we can.

> > -
> > -	return nr_free_pages;
> > -}
> > -#endif /* CONFIG_SMP */
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index a8cfa9c..a9b4542 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> >  {
> >  	/* free_pages my go negative - that's OK */
> >  	long min = mark;
> > -	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
> > +	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
> >  	int o;
> >  
> >  	if (alloc_flags & ALLOC_HIGH)
> > @@ -2436,7 +2436,7 @@ void show_free_areas(void)
> >  			" all_unreclaimable? %s"
> >  			"\n",
> >  			zone->name,
> > -			K(zone_nr_free_pages(zone)),
> > +			K(zone_page_state(zone, NR_FREE_PAGES)),
> >  			K(min_wmark_pages(zone)),
> >  			K(low_wmark_pages(zone)),
> >  			K(high_wmark_pages(zone)),
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index c5dfabf..47ba29e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2378,7 +2378,9 @@ static int kswapd(void *p)
> >  				 */
> >  				if (!sleeping_prematurely(pgdat, order, remaining)) {
> >  					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> > +					enable_pgdat_percpu_threshold(pgdat);
> >  					schedule();
> > +					disable_pgdat_percpu_threshold(pgdat);
> 
> If we have 4096 cpus, max drift = 125x4096x4096 ~= 2GB. It is higher than zone watermark.
> Then, such sysmtem can makes memory exshost before kswap call disable_pgdat_percpu_threshold().
> 

I don't *think* so but lets explore that possibility. For this to occur, all
CPUs would have to be allocating all of their memory from the one node (4096
CPUs is not going to be UMA) which is not going to happen. But allocations
from one node could be falling over to others of course.

Lets take an early condition that has to occur for a 4096 CPU machine to
get into trouble - node 0 exhausted and moving to node 1 and counter drift
makes us think everything is fine.

__alloc_pages_nodemask
  -> get_page_from_freelist
    -> zone_watermark_ok == true (because we are drifting)
    -> buffered_rmqueue
      -> __rmqueue (fails eventually, no pages despite watermark_ok)
  -> __alloc_pages_slowpath
    -> wake_all_kswapd()
...
kswapd wakes
  -> disable_pgdat_percpu_threshold()

i.e. as each node becomes exhausted in reality, kswapd will wake up, disable
the thresholds until the high watermark is back and go back to sleep. I'm
not seeing how we'd get into a situation where all kswapds are asleep at the
same time while each allocator allocates all of memory without managing to
wake kswapd. Even GFP_ATOMIC allocations will wakeup kswapd.

Hence, I think the current patch of disabling thresholds while kswapd is
awake to be sufficient to avoid livelock due to memory exhaustion and
counter drift.

> Hmmm....
> This seems fundamental problem. current our zone watermark and per-cpu stat threshold have completely
> unbalanced definition.
> 
> zone watermak:             very few (few mega bytes)
>                                        propotional sqrt(mem)
>                                        no propotional nr-cpus
> 
> per-cpu stat threshold:  relatively large (desktop: few mega bytes, server ~50MB, SGI 2GB ;-)
>                                        propotional log(mem)
>                                        propotional log(nr-cpus)
> 
> It mean, much cpus break watermark assumption.....
> 

They are for different things. watermarks are meant to prevent livelock
due to memory exhaustion. per-cpu thresholds are so that counters have
acceptable performance. The assumptions of watermarks remain the same
but we have to correctly handle when counter drift can break watermarks.

> 
> 
> >  				} else {
> >  					if (remaining)
> >  						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 355a9e6..19bd4a1 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -81,6 +81,12 @@ EXPORT_SYMBOL(vm_stat);
> >  
> >  #ifdef CONFIG_SMP
> >  
> > +static int calculate_pressure_threshold(struct zone *zone)
> > +{
> > +	return max(1, (int)((high_wmark_pages(zone) - low_wmark_pages(zone) /
> > +				num_online_cpus())));
> > +}
> > +
> >  static int calculate_threshold(struct zone *zone)
> >  {
> >  	int threshold;
> > @@ -159,6 +165,40 @@ static void refresh_zone_stat_thresholds(void)
> >  	}
> >  }
> >  
> > +void disable_pgdat_percpu_threshold(pg_data_t *pgdat)
> > +{
> > +	struct zone *zone;
> > +	int cpu;
> > +	int threshold;
> > +
> > +	for_each_populated_zone(zone) {
> > +		if (!zone->percpu_drift_mark || zone->zone_pgdat != pgdat)
> > +			continue;
> 
> We don't use for_each_populated_zone() and "zone->zone_pgdat != pgdat" combination
> in almost place.
> 
>         for (i = 0; i < pgdat->nr_zones; i++) {
> 
> is enough?
> 

Yes, it would be

> > +
> > +		threshold = calculate_pressure_threshold(zone);
> > +		for_each_online_cpu(cpu)
> > +			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> > +							= threshold;
> >
> > +	}
> > +}
> > +
> > +void enable_pgdat_percpu_threshold(pg_data_t *pgdat)
> > +{
> > +	struct zone *zone;
> > +	int cpu;
> > +	int threshold;
> > +
> > +	for_each_populated_zone(zone) {
> > +		if (!zone->percpu_drift_mark || zone->zone_pgdat != pgdat)
> > +			continue;
> > +
> > +		threshold = calculate_threshold(zone);
> > +		for_each_online_cpu(cpu)
> > +			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> > +							= threshold;
> > +	}
> > +}
> 
> disable_pgdat_percpu_threshold() and enable_pgdat_percpu_threshold() are
> almostly same. can you merge them?
> 

I wondered the same but as thresholds are calculated per-zone, I didn't see
how that could be handled in a unified function without using a callback
function pointer. If I used callback functions and an additional boolean, I
could merge refresh_zone_stat_thresholds(), disable_pgdat_percpu_threshold()
and enable_pgdat_percpu_threshold() but I worried the end-result would be
a bit unreadable and hinder review. I could roll a standalone patch that
merges the three if we end up agreeing on this patches general approach
to counter drift.

> 
> > +
> >  /*
> >   * For use when we know that interrupts are disabled.
> >   */
> > @@ -826,7 +866,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> >  		   "\n        scanned  %lu"
> >  		   "\n        spanned  %lu"
> >  		   "\n        present  %lu",
> > -		   zone_nr_free_pages(zone),
> > +		   zone_page_state(zone, NR_FREE_PAGES),
> >  		   min_wmark_pages(zone),
> >  		   low_wmark_pages(zone),
> >  		   high_wmark_pages(zone),
> > 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-10-13  2:41           ` Shaohua Li
@ 2010-10-13 12:09             ` Mel Gorman
  0 siblings, 0 replies; 65+ messages in thread
From: Mel Gorman @ 2010-10-13 12:09 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-mm, cl, Andrew Morton, David Rientjes, KOSAKI Motohiro

On Wed, Oct 13, 2010 at 10:41:36AM +0800, Shaohua Li wrote:
> On Wed, Oct 13, 2010 at 12:25:26AM +0800, Mel Gorman wrote:
> > > > > > > In a 4 socket 64 CPU system, zone_nr_free_pages() takes about 5% ~ 10% cpu time
> > > > > > > according to perf when memory pressure is high. The workload does something
> > > > > > > like:
> > > > > > > for i in `seq 1 $nr_cpu`
> > > > > > > do
> > > > > > >         create_sparse_file $SPARSE_FILE-$i $((10 * mem / nr_cpu))
> > > > > > >         $USEMEM -f $SPARSE_FILE-$i -j 4096 --readonly $((10 * mem / nr_cpu)) &
> > > > > > > done
> > > > > > > this simply reads a sparse file for each CPU. Apparently the
> > > > > > > zone->percpu_drift_mark is too big, and guess zone_page_state_snapshot() makes
> > > > > > > a lot of cache bounce for ->vm_stat_diff[]. below is the zoneinfo for reference.
> > > > > > 
> > > > > > Would it be possible for you to post the oprofile report? I'm in the
> > > > > > early stages of trying to reproduce this locally based on your test
> > > > > > description. The first machine I tried showed that zone_nr_page_state
> > > > > > was consuming 0.26% of profile time with the vast bulk occupied by
> > > > > > do_mpage_readahead. See as follows
> > > > > > 
> > > > > > 1599339  53.3463  vmlinux-2.6.36-rc7-pcpudrift do_mpage_readpage
> > > > > > 131713    4.3933  vmlinux-2.6.36-rc7-pcpudrift __isolate_lru_page
> > > > > > 103958    3.4675  vmlinux-2.6.36-rc7-pcpudrift free_pcppages_bulk
> > > > > > 85024     2.8360  vmlinux-2.6.36-rc7-pcpudrift __rmqueue
> > > > > > 78697     2.6250  vmlinux-2.6.36-rc7-pcpudrift native_flush_tlb_others
> > > > > > 75678     2.5243  vmlinux-2.6.36-rc7-pcpudrift unlock_page
> > > > > > 68741     2.2929  vmlinux-2.6.36-rc7-pcpudrift get_page_from_freelist
> > > > > > 56043     1.8693  vmlinux-2.6.36-rc7-pcpudrift __alloc_pages_nodemask
> > > > > > 55863     1.8633  vmlinux-2.6.36-rc7-pcpudrift ____pagevec_lru_add
> > > > > > 46044     1.5358  vmlinux-2.6.36-rc7-pcpudrift radix_tree_delete
> > > > > > 44543     1.4857  vmlinux-2.6.36-rc7-pcpudrift shrink_page_list
> > > > > > 33636     1.1219  vmlinux-2.6.36-rc7-pcpudrift zone_watermark_ok
> > > > > > .....
> > > > > > 7855      0.2620  vmlinux-2.6.36-rc7-pcpudrift zone_nr_free_pages
> > > > > > 
> > > > > > The machine I am testing on is non-NUMA 4-core single socket and totally
> > > > > > different characteristics but I want to be sure I'm going more or less the
> > > > > > right direction with the reproduction case before trying to find a larger
> > > > > > machine.
> > > > > 
> > > > > Here it is. this is a 4 socket nahalem machine.
> > > > >            268160.00 57.2% _raw_spin_lock                      /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > > > >             40302.00  8.6% zone_nr_free_pages                  /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > > > >             36827.00  7.9% do_mpage_readpage                   /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > > > >             28011.00  6.0% _raw_spin_lock_irq                  /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > > > >             22973.00  4.9% flush_tlb_others_ipi                /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > > > >             10713.00  2.3% smp_invalidate_interrupt            /lib/modules/2.6.36-rc5-shli+/build/vmlinux
> > > > 
> > > > <SNIP>
> > > >
> > > Basically the similar test. I'm using Fengguang's test, please check attached
> > > file. I didn't enable lock stat or debug. The difference is my test is under a
> > > 4 socket system. In a 1 socket system, I don't see the issue too.
> > > 
> > 
> > Ok, finding a large enough machine was key here true enough. I don't
> > have access to Nehalem boxes but the same problem showed up on a large
> > ppc64 machine (8 socket, interestingly enough a 3 socket did not have any
> > significant problem).  Based on that, I reproduced the problem and came up
> > with the patch below.
> > 
> > Christoph, can you look at this please? I know you had concerns about adjusting
> > thresholds as being an expensive operation but the patch limits how often it
> > occurs and it seems better than reducing thresholds for the full lifetime of
> > the system just to avoid counter drift. What I did find with the patch that
> > the overhead of __mod_zone_page_state() increases because of the temporarily
> > reduced threshold. It goes from 0.0403% of profile time to 0.0967% on one
> > machine and from 0.0677% to 0.43% on another. As this is just while kswapd
> > is awake, it seems withiin an acceptable margin but it is a caution against
> > simply reducing the existing thresholds. What is more relevant is the time
> > to complete the benchmark is increased due to the reduction of the thresholds.
> > This is a tradeoff between being fast and safe but I'm open to
> > suggestions on how high a safe threshold might be.
> > 
> > Shaohua, can you test keeping an eye out for any additional function
> > that is now taking a lot more CPU time?
>
> seems ok so far in the 4 sockets system. In this system, each node has 8G
> memory, so the threshold is 5 with memory pressure. Haven't tested this
> in some small machines yet, for example, each node just has 4G memory, etc.
> 

Please do. I tested on 3 small UMA machines myself as well as the two large
machines but spotted no problems.  This is somewhat expected as the counter
drift problem only really applies to larger machines. I've included a V2
below based on Kosaki's feedback but functionally it should be identical.

==== CUT HERE ====
mm: page allocator: Adjust the per-cpu counter threshold when memory is low V2

Commit [aa45484: calculate a better estimate of NR_FREE_PAGES when
memory is low] noted that watermarks were based on the vmstat
NR_FREE_PAGES. To avoid synchronization overhead, these counters are
maintained on a per-cpu basis and drained both periodically and when a
threshold is above a threshold. On large CPU systems, the difference
between the estimate and real value of NR_FREE_PAGES can be very high.
The system can get into a case where pages are allocated far below the
min watermark potentially causing livelock issues. The commit solved the
problem by taking a better reading of NR_FREE_PAGES when memory was low.

Unfortately, as reported by Shaohua Li this accurate reading can consume
a large amount of CPU time on systems with many sockets due to cache
line bouncing. This patch takes a different approach. For large machines
where counter drift might be unsafe and while kswapd is awake, the per-cpu
thresholds for the target pgdat are reduced to limit the level of drift
to what should be a safe level. This incurs a performance penalty in heavy
memory pressure by a factor that depends on the workload and the machine but
the machine should function correctly without accidentally exhausting all
memory on a node. There is an additional cost when kswapd wakes and sleeps
but the event is not expected to be frequent - in Shaohua's test case,
there was one recorded sleep and wake event at least.

Reported-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |    6 ------
 include/linux/vmstat.h |   24 ++----------------------
 mm/mmzone.c            |   21 ---------------------
 mm/page_alloc.c        |    4 ++--
 mm/vmscan.c            |    2 ++
 mm/vmstat.c            |   46 +++++++++++++++++++++++++++++++++++++++++++++-
 6 files changed, 51 insertions(+), 52 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3984c4e..343fd5c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -448,12 +448,6 @@ static inline int zone_is_oom_locked(const struct zone *zone)
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
 }
 
-#ifdef CONFIG_SMP
-unsigned long zone_nr_free_pages(struct zone *zone);
-#else
-#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
-#endif /* CONFIG_SMP */
-
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index eaaea37..a36e65b 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -170,28 +170,6 @@ static inline unsigned long zone_page_state(struct zone *zone,
 	return x;
 }
 
-/*
- * More accurate version that also considers the currently pending
- * deltas. For that we need to loop over all cpus to find the current
- * deltas. There is no synchronization so the result cannot be
- * exactly accurate either.
- */
-static inline unsigned long zone_page_state_snapshot(struct zone *zone,
-					enum zone_stat_item item)
-{
-	long x = atomic_long_read(&zone->vm_stat[item]);
-
-#ifdef CONFIG_SMP
-	int cpu;
-	for_each_online_cpu(cpu)
-		x += per_cpu_ptr(zone->pageset, cpu)->vm_stat_diff[item];
-
-	if (x < 0)
-		x = 0;
-#endif
-	return x;
-}
-
 extern unsigned long global_reclaimable_pages(void);
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 
@@ -254,6 +232,8 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
 void refresh_cpu_vm_stats(int);
+void disable_pgdat_percpu_threshold(pg_data_t *pgdat);
+void enable_pgdat_percpu_threshold(pg_data_t *pgdat);
 #else /* CONFIG_SMP */
 
 /*
diff --git a/mm/mmzone.c b/mm/mmzone.c
index e35bfb8..f5b7d17 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,24 +87,3 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
-
-#ifdef CONFIG_SMP
-/* Called when a more accurate view of NR_FREE_PAGES is needed */
-unsigned long zone_nr_free_pages(struct zone *zone)
-{
-	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
-
-	/*
-	 * While kswapd is awake, it is considered the zone is under some
-	 * memory pressure. Under pressure, there is a risk that
-	 * per-cpu-counter-drift will allow the min watermark to be breached
-	 * potentially causing a live-lock. While kswapd is awake and
-	 * free pages are low, get a better estimate for free pages
-	 */
-	if (nr_free_pages < zone->percpu_drift_mark &&
-			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
-		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
-
-	return nr_free_pages;
-}
-#endif /* CONFIG_SMP */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a8cfa9c..a9b4542 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
+	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
 	int o;
 
 	if (alloc_flags & ALLOC_HIGH)
@@ -2436,7 +2436,7 @@ void show_free_areas(void)
 			" all_unreclaimable? %s"
 			"\n",
 			zone->name,
-			K(zone_nr_free_pages(zone)),
+			K(zone_page_state(zone, NR_FREE_PAGES)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c5dfabf..47ba29e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2378,7 +2378,9 @@ static int kswapd(void *p)
 				 */
 				if (!sleeping_prematurely(pgdat, order, remaining)) {
 					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
+					enable_pgdat_percpu_threshold(pgdat);
 					schedule();
+					disable_pgdat_percpu_threshold(pgdat);
 				} else {
 					if (remaining)
 						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 355a9e6..ddee139 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -81,6 +81,12 @@ EXPORT_SYMBOL(vm_stat);
 
 #ifdef CONFIG_SMP
 
+static int calculate_pressure_threshold(struct zone *zone)
+{
+	return max(1, (int)((high_wmark_pages(zone) - low_wmark_pages(zone) /
+				num_online_cpus())));
+}
+
 static int calculate_threshold(struct zone *zone)
 {
 	int threshold;
@@ -159,6 +165,44 @@ static void refresh_zone_stat_thresholds(void)
 	}
 }
 
+void disable_pgdat_percpu_threshold(pg_data_t *pgdat)
+{
+	struct zone *zone;
+	int cpu;
+	int threshold;
+	int i;
+
+	for (i = 0; i < pgdat->nr_zones; i++) {
+		zone = &pgdat->node_zones[i];
+		if (!zone->percpu_drift_mark)
+			continue;
+
+		threshold = calculate_pressure_threshold(zone);
+		for_each_online_cpu(cpu)
+			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+							= threshold;
+	}
+}
+
+void enable_pgdat_percpu_threshold(pg_data_t *pgdat)
+{
+	struct zone *zone;
+	int cpu;
+	int threshold;
+	int i;
+
+	for (i = 0; i < pgdat->nr_zones; i++) {
+		zone = &pgdat->node_zones[i];
+		if (!zone->percpu_drift_mark)
+			continue;
+
+		threshold = calculate_threshold(zone);
+		for_each_online_cpu(cpu)
+			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+							= threshold;
+	}
+}
+
 /*
  * For use when we know that interrupts are disabled.
  */
@@ -826,7 +870,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
-		   zone_nr_free_pages(zone),
+		   zone_page_state(zone, NR_FREE_PAGES),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH 1/3] mm, mem-hotplug: recalculate lowmem_reserve when memory hotplug occur
  2010-10-13  6:27               ` [RFC][PATCH 1/3] mm, mem-hotplug: recalculate lowmem_reserve when memory hotplug occur KOSAKI Motohiro
  2010-10-13  6:39                 ` KAMEZAWA Hiroyuki
@ 2010-10-13 12:59                 ` Mel Gorman
  2010-10-14  2:44                   ` KOSAKI Motohiro
  1 sibling, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2010-10-13 12:59 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Shaohua Li, linux-mm, cl, Andrew Morton, David Rientjes

On Wed, Oct 13, 2010 at 03:27:12PM +0900, KOSAKI Motohiro wrote:
> Currently, memory hotplu call setup_per_zone_wmarks() and
> calculate_zone_inactive_ratio(), but don't call setup_per_zone_lowmem_reserve.
> 
> It mean number of reserved pages aren't updated even if memory hot plug
> occur. This patch fixes it.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Ok, I see the logic although the changelog needs a better description as
to why this matters and what the consequences are. It appears unrelated
to Shaohua's problem for example. Otherwise the patch looks reasonable

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH 2/3] mm: update pcp->stat_threshold when memory hotplug occur
  2010-10-13  6:28               ` [RFC][PATCH 2/3] mm: update pcp->stat_threshold " KOSAKI Motohiro
  2010-10-13  6:40                 ` KAMEZAWA Hiroyuki
@ 2010-10-13 13:02                 ` Mel Gorman
  1 sibling, 0 replies; 65+ messages in thread
From: Mel Gorman @ 2010-10-13 13:02 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Shaohua Li, linux-mm, cl, Andrew Morton, David Rientjes,
	KAMEZAWA Hiroyuki

On Wed, Oct 13, 2010 at 03:28:14PM +0900, KOSAKI Motohiro wrote:
> Currently, cpu hotplug updates pcp->stat_threashold, but memory
> hotplug doesn't. there is no reason.
> 
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Again, this patch seems reasonable but unrelated to Shaohua's problem
except in the specific case where a memory hotplug operations changes
the point per-cpu drift becomes a problem.

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH 3/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot()
  2010-10-13  6:32               ` [RFC][PATCH 3/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot() KOSAKI Motohiro
@ 2010-10-13 13:19                 ` Mel Gorman
  2010-10-14  2:39                   ` KOSAKI Motohiro
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2010-10-13 13:19 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Shaohua Li, linux-mm, cl, Andrew Morton, David Rientjes,
	KAMEZAWA Hiroyuki

On Wed, Oct 13, 2010 at 03:32:08PM +0900, KOSAKI Motohiro wrote:
> Shaohua Li reported commit aa45484031(mm: page allocator: calculate a
> better estimate of NR_FREE_PAGES when memory is low and kswapd is awake)
> made performance regression.
> 
> | In a 4 socket 64 CPU system, zone_nr_free_pages() takes about 5% ~ 10%
> | cpu time
> | according to perf when memory pressure is high. The workload does
> | something like:
> | for i in `seq 1 $nr_cpu`
> | do
> |        create_sparse_file $SPARSE_FILE-$i $((10 * mem / nr_cpu))
> |        $USEMEM -f $SPARSE_FILE-$i -j 4096 --readonly $((10 * mem / nr_cpu)) &
> | done
> | this simply reads a sparse file for each CPU. Apparently the
> | zone->percpu_drift_mark is too big, and guess zone_page_state_snapshot()
> | makes a lot of cache bounce for ->vm_stat_diff[]. below is the zoneinfo for
> | reference.
> | Is there any way to reduce the overhead?
> |
> | Node 3, zone   Normal
> | pages free     2055926
> |         min      1441
> |         low      1801
> |         high     2161
> |         scanned  0
> |         spanned  2097152
> |         present  2068480
> |   vm stats threshold: 98
> 
> It mean zone_page_state_snapshot() is costly than we expected.

Yes although remember *why* it's expensive. It's not the iteration of
the loop per-se. It's the bouncing of cache lines between cores on
different sockets because it's heavily dirtied data. That's why my patch
took the approach of reducing the thresholds so that local CPUs would
sync more frequently but it wouldn't be bouncing as much dirty data
between sockets.

> This
> patch introduced very different approach. we are reserving max-drift pages
> at first instead runtime free page calculation.
> 
> But, this technique can't be used on much cpus and few memory systems.
> On such system, we still need to use zone_page_state_snapshot().
> 
> Example1: typical desktop
>   CPU: 2
>   MEM: 2GB
> 
>   old) zone->min = sqrt(2x1024x1024x16) = 5792 KB = 1448 pages
>   new) max-drift = 2 x log2(2) x log2(2x1024/128) x 2 = 40
>        zone->min = 1448 + 40 = 1488 pages
> 
> Example2: relatively large server
>   CPU: 64
>   MEM: 8GBx4 (=32GB)
> 
>   old) zone->min = sqrt(32x1024x1024x16)/4 = 5792 KB = 1448 pages
>   new) max-drift = 2 x log2(64) x log2(8x1024/128) x 64 = 6272 pages
>        zone->min = 1448 + 6272 = 7720 pages
> 
>   Hmm, zone->min became almost 5x times. Is it acceptable? I think yes.

Yes, I think that's acceptable. The level needed for the best fragmentation
avoidance is higher than that for systems that have varying requirements
for the number of huge pages.

>   Today, we can buy 8GB DRAM for $20. So, 6272 pages (=24.5MB) waste
>   mean about 6 cent waste. It's good deal for getting good performance.
> 
> Example3: ultimately big server
>   CPU: 2048
>   MEM: 64GBx256 (=16TB)
> 
>   old) zone->min = sqrt(16x1024x1024x1024x16)/256 = 2048 KB = 512 pages
>                                       (Wow!, it's smaller than desktop)
>   new) max-drift = 125 x 2048 = 256000 pages = 1000MB (greater than 64GB/100)
>        zone->min = 512 pages
> 
> Reported-by: Shaohua Li <shaohua.li@intel.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/page_alloc.c |    9 +++++++++
>  1 files changed, 9 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 53627fa..194bdaa 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4897,6 +4897,15 @@ static void setup_per_zone_wmarks(void)
>  	for_each_zone(zone) {
>  		u64 tmp;
>  
> +		/*
> +		 * If max drift are less than 1%, reserve max drift pages
> +		 * instead costly runtime calculation.
> +		 */
> +		if (zone->percpu_drift_mark < (zone->present_pages/100)) {
> +			pages_min += zone->percpu_drift_mark;
> +			zone->percpu_drift_mark = 0;
> +		}
> +

I don't see how this solves Shaohua's problem as such. Large systems will
still suffer a bug performance penalty from zone_page_state_snapshot(). I
do see the logic of adjusting min for larger systems to limit the amount of
time per-cpu thresholds are lowered but that would be as a follow-on to my
patch rather than a replacement.

>  		spin_lock_irqsave(&zone->lock, flags);
>  		tmp = (u64)pages_min * zone->present_pages;
>  		do_div(tmp, lowmem_pages);
> -- 
> 1.6.5.2
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [experimental][PATCH] mm,vmstat: per cpu stat flush too when per cpu page cache flushed
  2010-10-13  7:10               ` [experimental][PATCH] mm,vmstat: per cpu stat flush too when per cpu page cache flushed KOSAKI Motohiro
  2010-10-13  7:16                 ` KAMEZAWA Hiroyuki
@ 2010-10-13 13:22                 ` Mel Gorman
  2010-10-14  2:50                   ` KOSAKI Motohiro
  2010-10-18 15:51                 ` Christoph Lameter
  2 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2010-10-13 13:22 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Shaohua Li, linux-mm, cl, Andrew Morton, David Rientjes,
	KAMEZAWA Hiroyuki

On Wed, Oct 13, 2010 at 04:10:43PM +0900, KOSAKI Motohiro wrote:
> When memory shortage, we are using drain_pages() for flushing per cpu
> page cache. In this case, per cpu stat should be flushed too. because
> now we are under memory shortage and we need to know exact free pages.
> 
> Otherwise get_page_from_freelist() may fail even though pcp was flushed.
> 

With my patch adjusting the threshold to a small value while kswapd is awake,
it seems less necessary. It's also very hard to predict the performance of
this. We are certainly going to take a hit to do the flush but we *might*
gain slightly if an allocation succeeds because a watermark check passed
when the counters were updated. It's a definite hit for a possible gain
though which is not a great trade-off. Would need some performance testing.

I still think my patch on adjusting thresholds is our best proposal so
far on how to reduce Shaohua's performance problems while still being
safer from livelocks due to memory exhaustion.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH 3/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot()
  2010-10-13 13:19                 ` Mel Gorman
@ 2010-10-14  2:39                   ` KOSAKI Motohiro
  2010-10-18 10:43                     ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: KOSAKI Motohiro @ 2010-10-14  2:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Shaohua Li, linux-mm, cl, Andrew Morton,
	David Rientjes, KAMEZAWA Hiroyuki

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 53627fa..194bdaa 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4897,6 +4897,15 @@ static void setup_per_zone_wmarks(void)
> >  	for_each_zone(zone) {
> >  		u64 tmp;
> >  
> > +		/*
> > +		 * If max drift are less than 1%, reserve max drift pages
> > +		 * instead costly runtime calculation.
> > +		 */
> > +		if (zone->percpu_drift_mark < (zone->present_pages/100)) {
> > +			pages_min += zone->percpu_drift_mark;
> > +			zone->percpu_drift_mark = 0;
> > +		}
> > +
> 
> I don't see how this solves Shaohua's problem as such. Large systems will
> still suffer a bug performance penalty from zone_page_state_snapshot(). I
> do see the logic of adjusting min for larger systems to limit the amount of
> time per-cpu thresholds are lowered but that would be as a follow-on to my
> patch rather than a replacement.

My patch rescue 256cpus or more smaller systems. and I assumed 4096cpus system don't
run IO intensive workload such as Shaohua's case. they always use cpusets and run hpc
workload.

If you know another >1024cpus system, please let me know.
And again, my patch works on 4096cpus sysmtem although slow, but your don't.

Am I missing something?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH 1/3] mm, mem-hotplug: recalculate lowmem_reserve when memory hotplug occur
  2010-10-13 12:59                 ` Mel Gorman
@ 2010-10-14  2:44                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 65+ messages in thread
From: KOSAKI Motohiro @ 2010-10-14  2:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Shaohua Li, linux-mm, cl, Andrew Morton, David Rientjes

> On Wed, Oct 13, 2010 at 03:27:12PM +0900, KOSAKI Motohiro wrote:
> > Currently, memory hotplu call setup_per_zone_wmarks() and
> > calculate_zone_inactive_ratio(), but don't call setup_per_zone_lowmem_reserve.
> > 
> > It mean number of reserved pages aren't updated even if memory hot plug
> > occur. This patch fixes it.
> > 
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> Ok, I see the logic although the changelog needs a better description as
> to why this matters and what the consequences are. It appears unrelated
> to Shaohua's problem for example. Otherwise the patch looks reasonable
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>

patch [1/3] and [2/3] is necessary for avoiding [3/3] break memory hotplug.
When memory hotplug occur, we need to update _all_ zone->present_pages related
vm parameters. otherwise they might become inconsistent and no workable.

more detail: [3/3] depend on setup_per_zone_wmarks() is always called after 
refresh_zone_stat_thresholds(). patch [1/3] and [2/3] does.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [experimental][PATCH] mm,vmstat: per cpu stat flush too when per cpu page cache flushed
  2010-10-13 13:22                 ` Mel Gorman
@ 2010-10-14  2:50                   ` KOSAKI Motohiro
  2010-10-15 17:31                     ` Christoph Lameter
  2010-10-18 11:08                     ` Mel Gorman
  0 siblings, 2 replies; 65+ messages in thread
From: KOSAKI Motohiro @ 2010-10-14  2:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Shaohua Li, linux-mm, cl, Andrew Morton,
	David Rientjes, KAMEZAWA Hiroyuki

> On Wed, Oct 13, 2010 at 04:10:43PM +0900, KOSAKI Motohiro wrote:
> > When memory shortage, we are using drain_pages() for flushing per cpu
> > page cache. In this case, per cpu stat should be flushed too. because
> > now we are under memory shortage and we need to know exact free pages.
> > 
> > Otherwise get_page_from_freelist() may fail even though pcp was flushed.
> > 
> 
> With my patch adjusting the threshold to a small value while kswapd is awake,
> it seems less necessary. 

I agree this.

> It's also very hard to predict the performance of
> this. We are certainly going to take a hit to do the flush but we *might*
> gain slightly if an allocation succeeds because a watermark check passed
> when the counters were updated. It's a definite hit for a possible gain
> though which is not a great trade-off. Would need some performance testing.
> 
> I still think my patch on adjusting thresholds is our best proposal so
> far on how to reduce Shaohua's performance problems while still being
> safer from livelocks due to memory exhaustion.

OK, I will try to explain a detai of my worry.

Initial variable ZVC commit (df9ecaba3f1) says 

>     [PATCH] ZVC: Scale thresholds depending on the size of the system
> 
>     The ZVC counter update threshold is currently set to a fixed value of 32.
>     This patch sets up the threshold depending on the number of processors and
>     the sizes of the zones in the system.
> 
>     With the current threshold of 32, I was able to observe slight contention
>     when more than 130-140 processors concurrently updated the counters.  The
>     contention vanished when I either increased the threshold to 64 or used
>     Andrew's idea of overstepping the interval (see ZVC overstep patch).
> 
>     However, we saw contention again at 220-230 processors.  So we need higher
>     values for larger systems.

So, I'm worry about your patch reintroduce old cache contention issue that Christoph
observed when run 128-256cpus system.  May I ask how do you think this issue?




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-10-13 11:24             ` zone state overhead Mel Gorman
@ 2010-10-14  3:07               ` KOSAKI Motohiro
  2010-10-18 10:39                 ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: KOSAKI Motohiro @ 2010-10-14  3:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Shaohua Li, linux-mm, cl, Andrew Morton,
	David Rientjes, KAMEZAWA Hiroyuki

Hi

> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index c5dfabf..47ba29e 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2378,7 +2378,9 @@ static int kswapd(void *p)
> > >  				 */
> > >  				if (!sleeping_prematurely(pgdat, order, remaining)) {
> > >  					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> > > +					enable_pgdat_percpu_threshold(pgdat);
> > >  					schedule();
> > > +					disable_pgdat_percpu_threshold(pgdat);
> > 
> > If we have 4096 cpus, max drift = 125x4096x4096 ~= 2GB. It is higher than zone watermark.
> > Then, such sysmtem can makes memory exshost before kswap call disable_pgdat_percpu_threshold().
> > 
> 
> I don't *think* so but lets explore that possibility. For this to occur, all
> CPUs would have to be allocating all of their memory from the one node (4096
> CPUs is not going to be UMA) which is not going to happen. But allocations
> from one node could be falling over to others of course.
> 
> Lets take an early condition that has to occur for a 4096 CPU machine to
> get into trouble - node 0 exhausted and moving to node 1 and counter drift
> makes us think everything is fine.
> 
> __alloc_pages_nodemask
>   -> get_page_from_freelist
>     -> zone_watermark_ok == true (because we are drifting)
>     -> buffered_rmqueue
>       -> __rmqueue (fails eventually, no pages despite watermark_ok)
>   -> __alloc_pages_slowpath
>     -> wake_all_kswapd()
> ...
>
> kswapd wakes
>   -> disable_pgdat_percpu_threshold()
> 
> i.e. as each node becomes exhausted in reality, kswapd will wake up, disable
> the thresholds until the high watermark is back and go back to sleep. I'm
> not seeing how we'd get into a situation where all kswapds are asleep at the
> same time while each allocator allocates all of memory without managing to
> wake kswapd. Even GFP_ATOMIC allocations will wakeup kswapd.
> 
> Hence, I think the current patch of disabling thresholds while kswapd is
> awake to be sufficient to avoid livelock due to memory exhaustion and
> counter drift.
> 

In this case, wakeup_kswapd() don't wake kswapd because

---------------------------------------------------------------------------------
void wakeup_kswapd(struct zone *zone, int order)
{
        pg_data_t *pgdat;

        if (!populated_zone(zone))
                return;

        pgdat = zone->zone_pgdat;
        if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0))
                return;                          // HERE
---------------------------------------------------------------------------------

So, if we take your approach, we need to know exact free pages in this.
But, zone_page_state_snapshot() is slow. that's dilemma.



> > Hmmm....
> > This seems fundamental problem. current our zone watermark and per-cpu stat threshold have completely
> > unbalanced definition.
> > 
> > zone watermak:             very few (few mega bytes)
> >                                        propotional sqrt(mem)
> >                                        no propotional nr-cpus
> > 
> > per-cpu stat threshold:  relatively large (desktop: few mega bytes, server ~50MB, SGI 2GB ;-)
> >                                        propotional log(mem)
> >                                        propotional log(nr-cpus)
> > 
> > It mean, much cpus break watermark assumption.....
> > 
> 
> They are for different things. watermarks are meant to prevent livelock
> due to memory exhaustion. per-cpu thresholds are so that counters have
> acceptable performance. The assumptions of watermarks remain the same
> but we have to correctly handle when counter drift can break watermarks.

ok.



> > > +void enable_pgdat_percpu_threshold(pg_data_t *pgdat)
> > > +{
> > > +	struct zone *zone;
> > > +	int cpu;
> > > +	int threshold;
> > > +
> > > +	for_each_populated_zone(zone) {
> > > +		if (!zone->percpu_drift_mark || zone->zone_pgdat != pgdat)
> > > +			continue;
> > > +
> > > +		threshold = calculate_threshold(zone);
> > > +		for_each_online_cpu(cpu)
> > > +			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> > > +							= threshold;
> > > +	}
> > > +}
> > 
> > disable_pgdat_percpu_threshold() and enable_pgdat_percpu_threshold() are
> > almostly same. can you merge them?
> > 
> 
> I wondered the same but as thresholds are calculated per-zone, I didn't see
> how that could be handled in a unified function without using a callback
> function pointer. If I used callback functions and an additional boolean, I
> could merge refresh_zone_stat_thresholds(), disable_pgdat_percpu_threshold()
> and enable_pgdat_percpu_threshold() but I worried the end-result would be
> a bit unreadable and hinder review. I could roll a standalone patch that
> merges the three if we end up agreeing on this patches general approach
> to counter drift.

ok, I think you are right.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [experimental][PATCH] mm,vmstat: per cpu stat flush too when per cpu page cache flushed
  2010-10-14  2:50                   ` KOSAKI Motohiro
@ 2010-10-15 17:31                     ` Christoph Lameter
  2010-10-18  9:27                       ` KOSAKI Motohiro
  2010-10-18 11:08                     ` Mel Gorman
  1 sibling, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2010-10-15 17:31 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Shaohua Li, linux-mm, Andrew Morton, David Rientjes,
	KAMEZAWA Hiroyuki

On Thu, 14 Oct 2010, KOSAKI Motohiro wrote:

> Initial variable ZVC commit (df9ecaba3f1) says
>
> >     [PATCH] ZVC: Scale thresholds depending on the size of the system
> >
> >     The ZVC counter update threshold is currently set to a fixed value of 32.
> >     This patch sets up the threshold depending on the number of processors and
> >     the sizes of the zones in the system.
> >
> >     With the current threshold of 32, I was able to observe slight contention
> >     when more than 130-140 processors concurrently updated the counters.  The
> >     contention vanished when I either increased the threshold to 64 or used
> >     Andrew's idea of overstepping the interval (see ZVC overstep patch).
> >
> >     However, we saw contention again at 220-230 processors.  So we need higher
> >     values for larger systems.
>
> So, I'm worry about your patch reintroduce old cache contention issue that Christoph
> observed when run 128-256cpus system.  May I ask how do you think this issue?

The load that I ran with was a test that concurrently faulted pages on a
large number of processors. This is a bit artificial and is only of
performance concern during startup of a large HPC job. The frequency of
counter updates during regular processing should pose a much lighter load
on the system. The automatic adaption of the thresholds should

1. Preserve the initial startup performance (since the threshold will be
unmodified on a system just starting).

2. Reduce the overhead of establish a more accurate zone state (because
reclaim can then cause the thresholds to be adapted).


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [experimental][PATCH] mm,vmstat: per cpu stat flush too when per cpu page cache flushed
  2010-10-15 17:31                     ` Christoph Lameter
@ 2010-10-18  9:27                       ` KOSAKI Motohiro
  2010-10-18 15:44                         ` Christoph Lameter
  0 siblings, 1 reply; 65+ messages in thread
From: KOSAKI Motohiro @ 2010-10-18  9:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Mel Gorman, Shaohua Li, linux-mm, Andrew Morton,
	David Rientjes, KAMEZAWA Hiroyuki

> On Thu, 14 Oct 2010, KOSAKI Motohiro wrote:
> 
> > Initial variable ZVC commit (df9ecaba3f1) says
> >
> > >     [PATCH] ZVC: Scale thresholds depending on the size of the system
> > >
> > >     The ZVC counter update threshold is currently set to a fixed value of 32.
> > >     This patch sets up the threshold depending on the number of processors and
> > >     the sizes of the zones in the system.
> > >
> > >     With the current threshold of 32, I was able to observe slight contention
> > >     when more than 130-140 processors concurrently updated the counters.  The
> > >     contention vanished when I either increased the threshold to 64 or used
> > >     Andrew's idea of overstepping the interval (see ZVC overstep patch).
> > >
> > >     However, we saw contention again at 220-230 processors.  So we need higher
> > >     values for larger systems.
> >
> > So, I'm worry about your patch reintroduce old cache contention issue that Christoph
> > observed when run 128-256cpus system.  May I ask how do you think this issue?
> 
> The load that I ran with was a test that concurrently faulted pages on a
> large number of processors. This is a bit artificial and is only of
> performance concern during startup of a large HPC job. The frequency of
> counter updates during regular processing should pose a much lighter load
> on the system. The automatic adaption of the thresholds should
> 
> 1. Preserve the initial startup performance (since the threshold will be
> unmodified on a system just starting).
> 
> 2. Reduce the overhead of establish a more accurate zone state (because
> reclaim can then cause the thresholds to be adapted).

One of idea is, limit number of drift cpus. because

256cpus:  125 x 256 x 4096 = 125MB  (ok, enough acceptable)
4096cpus: 125 x 4096 x 4096 = 2GB   (Agh, too big)

and, I think we can assume 4096cpus machine are used on cpusets or
virtualization or something else isolation mechanism.
then, practically there is never happen that all cpus access same zone
at same time.




---
 include/linux/mmzone.h |   16 ++---------
 include/linux/vmstat.h |    6 ++++
 mm/mmzone.c            |   20 --------------
 mm/page_alloc.c        |   10 ++++---
 mm/vmstat.c            |   67 +++++++++++++++++++++++++++++++++---------------
 5 files changed, 61 insertions(+), 58 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 39c24eb..699cdea 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -185,6 +185,7 @@ struct per_cpu_pageset {
 #ifdef CONFIG_SMP
 	s8 stat_threshold;
 	s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
+	s8 vm_stat_drifted[NR_VM_ZONE_STAT_ITEMS];
 #endif
 };
 
@@ -286,13 +287,6 @@ struct zone {
 	unsigned long watermark[NR_WMARK];
 
 	/*
-	 * When free pages are below this point, additional steps are taken
-	 * when reading the number of free pages to avoid per-cpu counter
-	 * drift allowing watermarks to be breached
-	 */
-	unsigned long percpu_drift_mark;
-
-	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
 	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -355,6 +349,8 @@ struct zone {
 
 	/* Zone statistics */
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
+	atomic_t		drift_cpus_cur[NR_VM_ZONE_STAT_ITEMS];
+	int			drift_cpus_max;
 
 	/*
 	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
@@ -458,12 +454,6 @@ static inline int zone_is_oom_locked(const struct zone *zone)
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
 }
 
-#ifdef CONFIG_SMP
-unsigned long zone_nr_free_pages(struct zone *zone);
-#else
-#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
-#endif /* CONFIG_SMP */
-
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 1997988..37f0917 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -28,6 +28,12 @@
 
 #define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL HIGHMEM_ZONE(xx) , xx##_MOVABLE
 
+#ifdef ARCH_MAX_DRIFT_CPUS
+#define MAX_DRIFT_CPUS ARCH_MAX_DRIFT_CPUS
+#else
+#define MAX_DRIFT_CPUS 256
+#endif
+
 enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGALLOC),
 		PGFREE, PGACTIVATE, PGDEACTIVATE,
diff --git a/mm/mmzone.c b/mm/mmzone.c
index e35bfb8..54ce17b 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -88,23 +88,3 @@ int memmap_valid_within(unsigned long pfn,
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
 
-#ifdef CONFIG_SMP
-/* Called when a more accurate view of NR_FREE_PAGES is needed */
-unsigned long zone_nr_free_pages(struct zone *zone)
-{
-	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
-
-	/*
-	 * While kswapd is awake, it is considered the zone is under some
-	 * memory pressure. Under pressure, there is a risk that
-	 * per-cpu-counter-drift will allow the min watermark to be breached
-	 * potentially causing a live-lock. While kswapd is awake and
-	 * free pages are low, get a better estimate for free pages
-	 */
-	if (nr_free_pages < zone->percpu_drift_mark &&
-			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
-		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
-
-	return nr_free_pages;
-}
-#endif /* CONFIG_SMP */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 222d8cc..f45aa19 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1464,7 +1464,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
+	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
 	int o;
 
 	if (alloc_flags & ALLOC_HIGH)
@@ -2438,7 +2438,7 @@ void show_free_areas(void)
 			" all_unreclaimable? %s"
 			"\n",
 			zone->name,
-			K(zone_nr_free_pages(zone)),
+			K(zone_page_state(zone, NR_FREE_PAGES)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
@@ -4895,10 +4895,12 @@ static void setup_per_zone_wmarks(void)
 	}
 
 	for_each_zone(zone) {
-		u64 tmp;
+		struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
+		u64 maxdrift = zone->drift_cpus_max * pcp->stat_threshold;
+		u64 tmp = pages_min + maxdrift;
 
 		spin_lock_irqsave(&zone->lock, flags);
-		tmp = (u64)pages_min * zone->present_pages;
+		tmp *= zone->present_pages;
 		do_div(tmp, lowmem_pages);
 		if (is_highmem(zone)) {
 			/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b6b7fed..673ca22 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -140,25 +140,32 @@ void refresh_zone_stat_thresholds(void)
 	int threshold;
 
 	for_each_populated_zone(zone) {
-		unsigned long max_drift, tolerate_drift;
-
+		zone->drift_cpus_max = min_t(int, num_online_cpus(),
+					     MAX_DRIFT_CPUS);
 		threshold = calculate_threshold(zone);
-
 		for_each_online_cpu(cpu)
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
+	}
+}
 
-		/*
-		 * Only set percpu_drift_mark if there is a danger that
-		 * NR_FREE_PAGES reports the low watermark is ok when in fact
-		 * the min watermark could be breached by an allocation
-		 */
-		tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
-		max_drift = num_online_cpus() * threshold;
-		if (max_drift > tolerate_drift)
-			zone->percpu_drift_mark = high_wmark_pages(zone) +
-					max_drift;
+static bool vm_stat_drift_take(struct zone *zone, enum zone_stat_item item)
+{
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
+	s8 *drifted = pcp->vm_stat_drifted + item;
+	atomic_t *cur = &zone->drift_cpus_cur[item];
+
+	if (likely(*drifted))
+		return true;
+
+	/* enter this slowpath only per sysctl_stat_interval. */
+
+	if (atomic_add_unless(cur, 1, zone->drift_cpus_max)) {
+		*drifted = 1;
+		return true;
 	}
+
+	return false;
 }
 
 /*
@@ -168,10 +175,14 @@ void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 				int delta)
 {
 	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
-
 	s8 *p = pcp->vm_stat_diff + item;
 	long x;
 
+	if (unlikely(!vm_stat_drift_take(zone, item))) {
+		zone_page_state_add(delta, zone, item);
+		return;
+	}
+
 	x = delta + *p;
 
 	if (unlikely(x > pcp->stat_threshold || x < -pcp->stat_threshold)) {
@@ -224,6 +235,11 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
 	s8 *p = pcp->vm_stat_diff + item;
 
+	if (unlikely(!vm_stat_drift_take(zone, item))) {
+		zone_page_state_add(1, zone, item);
+		return;
+	}
+
 	(*p)++;
 
 	if (unlikely(*p > pcp->stat_threshold)) {
@@ -245,6 +261,11 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
 	s8 *p = pcp->vm_stat_diff + item;
 
+	if (unlikely(!vm_stat_drift_take(zone, item))) {
+		zone_page_state_add(-1, zone, item);
+		return;
+	}
+
 	(*p)--;
 
 	if (unlikely(*p < - pcp->stat_threshold)) {
@@ -326,12 +347,16 @@ void refresh_cpu_vm_stats(int cpu)
 				unsigned long flags;
 				int v;
 
-				local_irq_save(flags);
-				v = p->vm_stat_diff[i];
-				p->vm_stat_diff[i] = 0;
-				local_irq_restore(flags);
-				atomic_long_add(v, &zone->vm_stat[i]);
-				global_diff[i] += v;
+				if (p->vm_stat_drifted[i]) {
+					local_irq_save(flags);
+					v = p->vm_stat_diff[i];
+					p->vm_stat_diff[i] = 0;
+					p->vm_stat_drifted[i] = 0;
+					local_irq_restore(flags);
+					atomic_long_add(v, &zone->vm_stat[i]);
+					global_diff[i] += v;
+				}
+
 #ifdef CONFIG_NUMA
 				/* 3 seconds idle till flush */
 				p->expire = 3;
@@ -834,7 +859,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
-		   zone_nr_free_pages(zone),
+		   zone_page_state(zone, NR_FREE_PAGES),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-- 
1.6.5.2




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-10-14  3:07               ` KOSAKI Motohiro
@ 2010-10-18 10:39                 ` Mel Gorman
  2010-10-19  1:16                   ` KOSAKI Motohiro
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2010-10-18 10:39 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Shaohua Li, linux-mm, cl, Andrew Morton, David Rientjes,
	KAMEZAWA Hiroyuki

On Thu, Oct 14, 2010 at 12:07:29PM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index c5dfabf..47ba29e 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2378,7 +2378,9 @@ static int kswapd(void *p)
> > > >  				 */
> > > >  				if (!sleeping_prematurely(pgdat, order, remaining)) {
> > > >  					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> > > > +					enable_pgdat_percpu_threshold(pgdat);
> > > >  					schedule();
> > > > +					disable_pgdat_percpu_threshold(pgdat);
> > > 
> > > If we have 4096 cpus, max drift = 125x4096x4096 ~= 2GB. It is higher than zone watermark.
> > > Then, such sysmtem can makes memory exshost before kswap call disable_pgdat_percpu_threshold().
> > > 
> > 
> > I don't *think* so but lets explore that possibility. For this to occur, all
> > CPUs would have to be allocating all of their memory from the one node (4096
> > CPUs is not going to be UMA) which is not going to happen. But allocations
> > from one node could be falling over to others of course.
> > 
> > Lets take an early condition that has to occur for a 4096 CPU machine to
> > get into trouble - node 0 exhausted and moving to node 1 and counter drift
> > makes us think everything is fine.
> > 
> > __alloc_pages_nodemask
> >   -> get_page_from_freelist
> >     -> zone_watermark_ok == true (because we are drifting)
> >     -> buffered_rmqueue
> >       -> __rmqueue (fails eventually, no pages despite watermark_ok)
> >   -> __alloc_pages_slowpath
> >     -> wake_all_kswapd()
> > ...
> >
> > kswapd wakes
> >   -> disable_pgdat_percpu_threshold()
> > 
> > i.e. as each node becomes exhausted in reality, kswapd will wake up, disable
> > the thresholds until the high watermark is back and go back to sleep. I'm
> > not seeing how we'd get into a situation where all kswapds are asleep at the
> > same time while each allocator allocates all of memory without managing to
> > wake kswapd. Even GFP_ATOMIC allocations will wakeup kswapd.
> > 
> > Hence, I think the current patch of disabling thresholds while kswapd is
> > awake to be sufficient to avoid livelock due to memory exhaustion and
> > counter drift.
> > 
> 
> In this case, wakeup_kswapd() don't wake kswapd because
> 
> ---------------------------------------------------------------------------------
> void wakeup_kswapd(struct zone *zone, int order)
> {
>         pg_data_t *pgdat;
> 
>         if (!populated_zone(zone))
>                 return;
> 
>         pgdat = zone->zone_pgdat;
>         if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0))
>                 return;                          // HERE
> ---------------------------------------------------------------------------------
> 
> So, if we take your approach, we need to know exact free pages in this.

Good point!

> But, zone_page_state_snapshot() is slow. that's dilemma.
> 

Very true. I'm prototyping a version of the patch that keeps
zone_page_state_snapshot but only uses is in wakeup_kswapd and
sleeping_prematurely.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH 3/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot()
  2010-10-14  2:39                   ` KOSAKI Motohiro
@ 2010-10-18 10:43                     ` Mel Gorman
  0 siblings, 0 replies; 65+ messages in thread
From: Mel Gorman @ 2010-10-18 10:43 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Shaohua Li, linux-mm, cl, Andrew Morton, David Rientjes,
	KAMEZAWA Hiroyuki

On Thu, Oct 14, 2010 at 11:39:34AM +0900, KOSAKI Motohiro wrote:
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 53627fa..194bdaa 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -4897,6 +4897,15 @@ static void setup_per_zone_wmarks(void)
> > >  	for_each_zone(zone) {
> > >  		u64 tmp;
> > >  
> > > +		/*
> > > +		 * If max drift are less than 1%, reserve max drift pages
> > > +		 * instead costly runtime calculation.
> > > +		 */
> > > +		if (zone->percpu_drift_mark < (zone->present_pages/100)) {
> > > +			pages_min += zone->percpu_drift_mark;
> > > +			zone->percpu_drift_mark = 0;
> > > +		}
> > > +
> > 
> > I don't see how this solves Shaohua's problem as such. Large systems will
> > still suffer a bug performance penalty from zone_page_state_snapshot(). I
> > do see the logic of adjusting min for larger systems to limit the amount of
> > time per-cpu thresholds are lowered but that would be as a follow-on to my
> > patch rather than a replacement.
> 
> My patch rescue 256cpus or more smaller systems.

True, and it would be nice to limit how many machines any of this logic
applies to.

> and I assumed 4096cpus system don't
> run IO intensive workload such as Shaohua's case.

Also true, but they still suffer the drift problem. The reproduction
case would change but otherwise the drift must still be handled.

> they always use cpusets and run hpc
> workload.
> 
> If you know another >1024cpus system, please let me know.
> And again, my patch works on 4096cpus sysmtem although slow, but your don't.
> 
> Am I missing something?
> 

I think both are ultimately needed. For my patch, I need to make sure that
wakeup_kswapd() actually wakes up kswapd by using
zone_page_state_snapshot() when necessary. Your patch avoids the problem
differently but in a way that is nice for "smaller" machines.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [experimental][PATCH] mm,vmstat: per cpu stat flush too when per cpu page cache flushed
  2010-10-14  2:50                   ` KOSAKI Motohiro
  2010-10-15 17:31                     ` Christoph Lameter
@ 2010-10-18 11:08                     ` Mel Gorman
  2010-10-19  1:34                       ` KOSAKI Motohiro
  1 sibling, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2010-10-18 11:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Shaohua Li, linux-mm, cl, Andrew Morton, David Rientjes,
	KAMEZAWA Hiroyuki

On Thu, Oct 14, 2010 at 11:50:28AM +0900, KOSAKI Motohiro wrote:
> > On Wed, Oct 13, 2010 at 04:10:43PM +0900, KOSAKI Motohiro wrote:
> > > When memory shortage, we are using drain_pages() for flushing per cpu
> > > page cache. In this case, per cpu stat should be flushed too. because
> > > now we are under memory shortage and we need to know exact free pages.
> > > 
> > > Otherwise get_page_from_freelist() may fail even though pcp was flushed.
> > > 
> > 
> > With my patch adjusting the threshold to a small value while kswapd is awake,
> > it seems less necessary. 
> 
> I agree this.
> 
> > It's also very hard to predict the performance of
> > this. We are certainly going to take a hit to do the flush but we *might*
> > gain slightly if an allocation succeeds because a watermark check passed
> > when the counters were updated. It's a definite hit for a possible gain
> > though which is not a great trade-off. Would need some performance testing.
> > 
> > I still think my patch on adjusting thresholds is our best proposal so
> > far on how to reduce Shaohua's performance problems while still being
> > safer from livelocks due to memory exhaustion.
> 
> OK, I will try to explain a detai of my worry.
> 
> Initial variable ZVC commit (df9ecaba3f1) says 
> 
> >     [PATCH] ZVC: Scale thresholds depending on the size of the system
> > 
> >     The ZVC counter update threshold is currently set to a fixed value of 32.
> >     This patch sets up the threshold depending on the number of processors and
> >     the sizes of the zones in the system.
> > 
> >     With the current threshold of 32, I was able to observe slight contention
> >     when more than 130-140 processors concurrently updated the counters.  The
> >     contention vanished when I either increased the threshold to 64 or used
> >     Andrew's idea of overstepping the interval (see ZVC overstep patch).
> > 
> >     However, we saw contention again at 220-230 processors.  So we need higher
> >     values for larger systems.
> 
> So, I'm worry about your patch reintroduce old cache contention issue that Christoph
> observed when run 128-256cpus system.  May I ask how do you think this issue?
> 

It only reintroduces the overhead while kswapd is awake and the system is in danger
of accidentally allocating all of its pages. Yes, it's slower but it's
less risky.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [experimental][PATCH] mm,vmstat: per cpu stat flush too when per cpu page cache flushed
  2010-10-18  9:27                       ` KOSAKI Motohiro
@ 2010-10-18 15:44                         ` Christoph Lameter
  2010-10-19  1:10                           ` KOSAKI Motohiro
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2010-10-18 15:44 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Shaohua Li, linux-mm, Andrew Morton, David Rientjes,
	KAMEZAWA Hiroyuki

On Mon, 18 Oct 2010, KOSAKI Motohiro wrote:

> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 39c24eb..699cdea 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -185,6 +185,7 @@ struct per_cpu_pageset {
>  #ifdef CONFIG_SMP
>  	s8 stat_threshold;
>  	s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
> +	s8 vm_stat_drifted[NR_VM_ZONE_STAT_ITEMS];
>  #endif
>  };

Significantly increases cache footprint for per_cpu_pagesets.

> @@ -168,10 +175,14 @@ void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
>  				int delta)
>  {
>  	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
> -
>  	s8 *p = pcp->vm_stat_diff + item;
>  	long x;
>
> +	if (unlikely(!vm_stat_drift_take(zone, item))) {
> +		zone_page_state_add(delta, zone, item);
> +		return;
> +	}
> +
>  	x = delta + *p;
>
>  	if (unlikely(x > pcp->stat_threshold || x < -pcp->stat_threshold)) {
> @@ -224,6 +235,11 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
>  	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
>  	s8 *p = pcp->vm_stat_diff + item;
>
> +	if (unlikely(!vm_stat_drift_take(zone, item))) {
> +		zone_page_state_add(1, zone, item);
> +		return;
> +	}
> +
>  	(*p)++;
>
>  	if (unlikely(*p > pcp->stat_threshold)) {
> @@ -245,6 +261,11 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
>  	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
>  	s8 *p = pcp->vm_stat_diff + item;
>
> +	if (unlikely(!vm_stat_drift_take(zone, item))) {
> +		zone_page_state_add(-1, zone, item);
> +		return;
> +	}
> +
>  	(*p)--;
>
>  	if (unlikely(*p < - pcp->stat_threshold)) {

Increased overhead for basic VM counter management.

Instead of all of this why not simply set the stat_threshold to 0 for
select cpus?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [experimental][PATCH] mm,vmstat: per cpu stat flush too when per cpu page cache flushed
  2010-10-13  7:10               ` [experimental][PATCH] mm,vmstat: per cpu stat flush too when per cpu page cache flushed KOSAKI Motohiro
  2010-10-13  7:16                 ` KAMEZAWA Hiroyuki
  2010-10-13 13:22                 ` Mel Gorman
@ 2010-10-18 15:51                 ` Christoph Lameter
  2010-10-19  0:43                   ` KOSAKI Motohiro
  2 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2010-10-18 15:51 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Shaohua Li, linux-mm, Andrew Morton, David Rientjes,
	KAMEZAWA Hiroyuki

On Wed, 13 Oct 2010, KOSAKI Motohiro wrote:

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 194bdaa..8b50e52 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1093,6 +1093,7 @@ static void drain_pages(unsigned int cpu)
>  		pcp = &pset->pcp;
>  		free_pcppages_bulk(zone, pcp->count, pcp);
>  		pcp->count = 0;
> +		__flush_zone_state(zone, NR_FREE_PAGES);
>  		local_irq_restore(flags);
>  	}
>  }

drain_zone_pages() is called from refresh_vm_stats() and
refresh_vm_stats() already flushes the counters. The patch will not change
anything.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [experimental][PATCH] mm,vmstat: per cpu stat flush too when per cpu page cache flushed
  2010-10-18 15:51                 ` Christoph Lameter
@ 2010-10-19  0:43                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 65+ messages in thread
From: KOSAKI Motohiro @ 2010-10-19  0:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Mel Gorman, Shaohua Li, linux-mm, Andrew Morton,
	David Rientjes, KAMEZAWA Hiroyuki

> On Wed, 13 Oct 2010, KOSAKI Motohiro wrote:
> 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 194bdaa..8b50e52 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1093,6 +1093,7 @@ static void drain_pages(unsigned int cpu)
> >  		pcp = &pset->pcp;
> >  		free_pcppages_bulk(zone, pcp->count, pcp);
> >  		pcp->count = 0;
> > +		__flush_zone_state(zone, NR_FREE_PAGES);
> >  		local_irq_restore(flags);
> >  	}
> >  }
> 
> drain_zone_pages() is called from refresh_vm_stats() and
> refresh_vm_stats() already flushes the counters. The patch will not change
> anything.

Well, it's drain_pages(), not drain_zone_pages(). drain_pages() is 
called from reclaim path.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [experimental][PATCH] mm,vmstat: per cpu stat flush too when per cpu page cache flushed
  2010-10-18 15:44                         ` Christoph Lameter
@ 2010-10-19  1:10                           ` KOSAKI Motohiro
  0 siblings, 0 replies; 65+ messages in thread
From: KOSAKI Motohiro @ 2010-10-19  1:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Mel Gorman, Shaohua Li, linux-mm, Andrew Morton,
	David Rientjes, KAMEZAWA Hiroyuki

> > @@ -245,6 +261,11 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
> >  	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
> >  	s8 *p = pcp->vm_stat_diff + item;
> >
> > +	if (unlikely(!vm_stat_drift_take(zone, item))) {
> > +		zone_page_state_add(-1, zone, item);
> > +		return;
> > +	}
> > +
> >  	(*p)--;
> >
> >  	if (unlikely(*p < - pcp->stat_threshold)) {
> 
> Increased overhead for basic VM counter management.
> 
> Instead of all of this why not simply set the stat_threshold to 0 for
> select cpus?

hmm... difficult. but I will think this way a while ;)



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-10-18 10:39                 ` Mel Gorman
@ 2010-10-19  1:16                   ` KOSAKI Motohiro
  2010-10-19  9:08                     ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: KOSAKI Motohiro @ 2010-10-19  1:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Shaohua Li, linux-mm, cl, Andrew Morton,
	David Rientjes, KAMEZAWA Hiroyuki

> > In this case, wakeup_kswapd() don't wake kswapd because
> > 
> > ---------------------------------------------------------------------------------
> > void wakeup_kswapd(struct zone *zone, int order)
> > {
> >         pg_data_t *pgdat;
> > 
> >         if (!populated_zone(zone))
> >                 return;
> > 
> >         pgdat = zone->zone_pgdat;
> >         if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0))
> >                 return;                          // HERE
> > ---------------------------------------------------------------------------------
> > 
> > So, if we take your approach, we need to know exact free pages in this.
> 
> Good point!
> 
> > But, zone_page_state_snapshot() is slow. that's dilemma.
> > 
> 
> Very true. I'm prototyping a version of the patch that keeps
> zone_page_state_snapshot but only uses is in wakeup_kswapd and
> sleeping_prematurely.

Ok, this might works. but note, if we are running IO intensive workload, wakeup_kswapd()
is called very frequently. because it is called even though allocation is succeed. we need to
request Shaohua run and mesure his problem workload. and can you please cc me
when you post next version? I hope to review it too.

Thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [experimental][PATCH] mm,vmstat: per cpu stat flush too when per cpu page cache flushed
  2010-10-18 11:08                     ` Mel Gorman
@ 2010-10-19  1:34                       ` KOSAKI Motohiro
  2010-10-19  9:06                         ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: KOSAKI Motohiro @ 2010-10-19  1:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Shaohua Li, linux-mm, cl, Andrew Morton,
	David Rientjes, KAMEZAWA Hiroyuki

> > Initial variable ZVC commit (df9ecaba3f1) says 
> > 
> > >     [PATCH] ZVC: Scale thresholds depending on the size of the system
> > > 
> > >     The ZVC counter update threshold is currently set to a fixed value of 32.
> > >     This patch sets up the threshold depending on the number of processors and
> > >     the sizes of the zones in the system.
> > > 
> > >     With the current threshold of 32, I was able to observe slight contention
> > >     when more than 130-140 processors concurrently updated the counters.  The
> > >     contention vanished when I either increased the threshold to 64 or used
> > >     Andrew's idea of overstepping the interval (see ZVC overstep patch).
> > > 
> > >     However, we saw contention again at 220-230 processors.  So we need higher
> > >     values for larger systems.
> > 
> > So, I'm worry about your patch reintroduce old cache contention issue that Christoph
> > observed when run 128-256cpus system.  May I ask how do you think this issue?
> 
> It only reintroduces the overhead while kswapd is awake and the system is in danger
> of accidentally allocating all of its pages. Yes, it's slower but it's
> less risky.

When we have rich storage and running IO intensive workload, kswapd are almost 
always awake ;)
However, yes, your approach is less risky.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [experimental][PATCH] mm,vmstat: per cpu stat flush too when per cpu page cache flushed
  2010-10-19  1:34                       ` KOSAKI Motohiro
@ 2010-10-19  9:06                         ` Mel Gorman
  0 siblings, 0 replies; 65+ messages in thread
From: Mel Gorman @ 2010-10-19  9:06 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Shaohua Li, linux-mm, cl, Andrew Morton, David Rientjes,
	KAMEZAWA Hiroyuki

On Tue, Oct 19, 2010 at 10:34:13AM +0900, KOSAKI Motohiro wrote:
> > > Initial variable ZVC commit (df9ecaba3f1) says 
> > > 
> > > >     [PATCH] ZVC: Scale thresholds depending on the size of the system
> > > > 
> > > >     The ZVC counter update threshold is currently set to a fixed value of 32.
> > > >     This patch sets up the threshold depending on the number of processors and
> > > >     the sizes of the zones in the system.
> > > > 
> > > >     With the current threshold of 32, I was able to observe slight contention
> > > >     when more than 130-140 processors concurrently updated the counters.  The
> > > >     contention vanished when I either increased the threshold to 64 or used
> > > >     Andrew's idea of overstepping the interval (see ZVC overstep patch).
> > > > 
> > > >     However, we saw contention again at 220-230 processors.  So we need higher
> > > >     values for larger systems.
> > > 
> > > So, I'm worry about your patch reintroduce old cache contention issue that Christoph
> > > observed when run 128-256cpus system.  May I ask how do you think this issue?
> > 
> > It only reintroduces the overhead while kswapd is awake and the system is in danger
> > of accidentally allocating all of its pages. Yes, it's slower but it's
> > less risky.
> 
> When we have rich storage and running IO intensive workload, kswapd are almost 
> always awake ;)

That's an interesting assertion because it's not just the storage and IO
that is a factor but the number of pages it requires. For example, lets
assume a workload is write-intensive but it is using the same 30% of memory
for I/O.  That workload should not keep kswapd awake and if it is, it should
be investigated. Are you aware of a situation like this?

An I/O intensive workload that kswapd is constantly awake for must be
continually requiring new data meaning it is either streaming writes or its
working set size exceeds physical memory. In the former case, there is no
much we can do because the workload is going to be blocked on I/O anyway to
write the pages. In the latter case, the machine could do with more memory
or the application could do with some tuning to reduce its footprint. In
either case, the workload is going to be more concerned with being blocked
on I/O than the increased cost of counters.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-10-19  1:16                   ` KOSAKI Motohiro
@ 2010-10-19  9:08                     ` Mel Gorman
  2010-10-22 14:12                       ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2010-10-19  9:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Shaohua Li, linux-mm, cl, Andrew Morton, David Rientjes,
	KAMEZAWA Hiroyuki

On Tue, Oct 19, 2010 at 10:16:42AM +0900, KOSAKI Motohiro wrote:
> > > In this case, wakeup_kswapd() don't wake kswapd because
> > > 
> > > ---------------------------------------------------------------------------------
> > > void wakeup_kswapd(struct zone *zone, int order)
> > > {
> > >         pg_data_t *pgdat;
> > > 
> > >         if (!populated_zone(zone))
> > >                 return;
> > > 
> > >         pgdat = zone->zone_pgdat;
> > >         if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0))
> > >                 return;                          // HERE
> > > ---------------------------------------------------------------------------------
> > > 
> > > So, if we take your approach, we need to know exact free pages in this.
> > 
> > Good point!
> > 
> > > But, zone_page_state_snapshot() is slow. that's dilemma.
> > > 
> > 
> > Very true. I'm prototyping a version of the patch that keeps
> > zone_page_state_snapshot but only uses is in wakeup_kswapd and
> > sleeping_prematurely.
> 
> Ok, this might works. but note, if we are running IO intensive workload, wakeup_kswapd()
> is called very frequently.

This is true. It is also necessary to alter wakeup_kswapd to minimise
the number of times it calls zone_watermark_ok_safe(). It'll need
careful review to be sure the new function is equivalent.

> because it is called even though allocation is succeed. we need to
> request Shaohua run and mesure his problem workload. and can you please cc me
> when you post next version? I hope to review it too.
> 

Of course. I have the prototype ready but am waiting on tests at the
moment. Unfortunately the necessary infrastructure has been unavailable for
the last 18 hours to run the test but I'm hoping it gets fixed soon.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-10-19  9:08                     ` Mel Gorman
@ 2010-10-22 14:12                       ` Mel Gorman
  2010-10-22 15:23                         ` Christoph Lameter
                                           ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Mel Gorman @ 2010-10-22 14:12 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Shaohua Li, linux-mm, cl, Andrew Morton, David Rientjes,
	KAMEZAWA Hiroyuki

On Tue, Oct 19, 2010 at 10:08:03AM +0100, Mel Gorman wrote:
> On Tue, Oct 19, 2010 at 10:16:42AM +0900, KOSAKI Motohiro wrote:
> > > > In this case, wakeup_kswapd() don't wake kswapd because
> > > > 
> > > > ---------------------------------------------------------------------------------
> > > > void wakeup_kswapd(struct zone *zone, int order)
> > > > {
> > > >         pg_data_t *pgdat;
> > > > 
> > > >         if (!populated_zone(zone))
> > > >                 return;
> > > > 
> > > >         pgdat = zone->zone_pgdat;
> > > >         if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0))
> > > >                 return;                          // HERE
> > > > ---------------------------------------------------------------------------------
> > > > 
> > > > So, if we take your approach, we need to know exact free pages in this.
> > > 
> > > Good point!
> > > 
> > > > But, zone_page_state_snapshot() is slow. that's dilemma.
> > > > 
> > > 
> > > Very true. I'm prototyping a version of the patch that keeps
> > > zone_page_state_snapshot but only uses is in wakeup_kswapd and
> > > sleeping_prematurely.
> > 
> > Ok, this might works. but note, if we are running IO intensive workload, wakeup_kswapd()
> > is called very frequently.
> 
> This is true. It is also necessary to alter wakeup_kswapd to minimise
> the number of times it calls zone_watermark_ok_safe(). It'll need
> careful review to be sure the new function is equivalent.
> 
> > because it is called even though allocation is succeed. we need to
> > request Shaohua run and mesure his problem workload. and can you please cc me
> > when you post next version? I hope to review it too.
> > 
> 
> Of course. I have the prototype ready but am waiting on tests at the
> moment. Unfortunately the necessary infrastructure has been unavailable for
> the last 18 hours to run the test but I'm hoping it gets fixed soon.
> 

The test machine finally came available again and the patch works as
expected. Here is the latest version. The main thing to look how for is
how wakeup_kswapd() is altered.

==== CUT HERE ====
mm: page allocator: Adjust the per-cpu counter threshold when memory is low

Commit [aa45484: calculate a better estimate of NR_FREE_PAGES when
memory is low] noted that watermarks were based on the vmstat
NR_FREE_PAGES. To avoid synchronization overhead, these counters are
maintained on a per-cpu basis and drained both periodically and when a
threshold is above a threshold. On large CPU systems, the difference
between the estimate and real value of NR_FREE_PAGES can be very high.
The system can get into a case where pages are allocated far below the
min watermark potentially causing livelock issues. The commit solved the
problem by taking a better reading of NR_FREE_PAGES when memory was low.

Unfortately, as reported by Shaohua Li this accurate reading can consume
a large amount of CPU time on systems with many sockets due to cache
line bouncing. This patch takes a different approach. For large machines
where counter drift might be unsafe and while kswapd is awake, the per-cpu
thresholds for the target pgdat are reduced to limit the level of drift
to what should be a safe level. This incurs a performance penalty in heavy
memory pressure by a factor that depends on the workload and the machine but
the machine should function correctly without accidentally exhausting all
memory on a node. There is an additional cost when kswapd wakes and sleeps
but the event is not expected to be frequent - in Shaohua's test case,
there was one recorded sleep and wake event at least.

To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
introduced that takes a more accurate reading of NR_FREE_PAGES when called
from wakeup_kswapd and when deciding whether it is really safe to go back
to sleep in sleeping_prematurely(). We are still using an expensive function
but limiting how often it is called.

When the test case is reproduced, the time spent in the watermark functions
is reduced. The following report is on the percentage of time spent
cumulatively spent in the functions zone_nr_free_pages(), zone_watermark_ok(),
__zone_watermark_ok(), zone_watermark_ok_safe(), zone_page_state_snapshot(),
zone_page_state().

vanilla                     10.6834%
disable-thresold             0.9153%

Reported-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |   10 +++-------
 include/linux/vmstat.h |    2 ++
 mm/mmzone.c            |   21 ---------------------
 mm/page_alloc.c        |   35 +++++++++++++++++++++++++++--------
 mm/vmscan.c            |   15 +++++++++------
 mm/vmstat.c            |   46 +++++++++++++++++++++++++++++++++++++++++++++-
 6 files changed, 86 insertions(+), 43 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3984c4e..8d789d7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -448,12 +448,6 @@ static inline int zone_is_oom_locked(const struct zone *zone)
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
 }
 
-#ifdef CONFIG_SMP
-unsigned long zone_nr_free_pages(struct zone *zone);
-#else
-#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
-#endif /* CONFIG_SMP */
-
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
@@ -651,7 +645,9 @@ typedef struct pglist_data {
 extern struct mutex zonelists_mutex;
 void build_all_zonelists(void *data);
 void wakeup_kswapd(struct zone *zone, int order);
-int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+		int classzone_idx, int alloc_flags);
+bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
 		int classzone_idx, int alloc_flags);
 enum memmap_context {
 	MEMMAP_EARLY,
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index eaaea37..c67d333 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -254,6 +254,8 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
 void refresh_cpu_vm_stats(int);
+void disable_pgdat_percpu_threshold(pg_data_t *pgdat);
+void enable_pgdat_percpu_threshold(pg_data_t *pgdat);
 #else /* CONFIG_SMP */
 
 /*
diff --git a/mm/mmzone.c b/mm/mmzone.c
index e35bfb8..f5b7d17 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,24 +87,3 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
-
-#ifdef CONFIG_SMP
-/* Called when a more accurate view of NR_FREE_PAGES is needed */
-unsigned long zone_nr_free_pages(struct zone *zone)
-{
-	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
-
-	/*
-	 * While kswapd is awake, it is considered the zone is under some
-	 * memory pressure. Under pressure, there is a risk that
-	 * per-cpu-counter-drift will allow the min watermark to be breached
-	 * potentially causing a live-lock. While kswapd is awake and
-	 * free pages are low, get a better estimate for free pages
-	 */
-	if (nr_free_pages < zone->percpu_drift_mark &&
-			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
-		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
-
-	return nr_free_pages;
-}
-#endif /* CONFIG_SMP */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a8cfa9c..acffde3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1454,24 +1454,24 @@ static inline int should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
 #endif /* CONFIG_FAIL_PAGE_ALLOC */
 
 /*
- * Return 1 if free pages are above 'mark'. This takes into account the order
+ * Return true if free pages are above 'mark'. This takes into account the order
  * of the allocation.
  */
-int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
-		      int classzone_idx, int alloc_flags)
+bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+		      int classzone_idx, int alloc_flags, long free_pages)
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
 	int o;
 
+ 	free_pages -= (1 << order) + 1;
 	if (alloc_flags & ALLOC_HIGH)
 		min -= min / 2;
 	if (alloc_flags & ALLOC_HARDER)
 		min -= min / 4;
 
 	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
-		return 0;
+		return false;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
 		free_pages -= z->free_area[o].nr_free << o;
@@ -1480,9 +1480,28 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 		min >>= 1;
 
 		if (free_pages <= min)
-			return 0;
+			return false;
 	}
-	return 1;
+	return true;
+}
+
+bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+		      int classzone_idx, int alloc_flags)
+{
+	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
+					zone_page_state(z, NR_FREE_PAGES));
+}
+
+bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
+		      int classzone_idx, int alloc_flags)
+{
+	long free_pages = zone_page_state(z, NR_FREE_PAGES);
+
+	if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
+		free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
+
+	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
+								free_pages);
 }
 
 #ifdef CONFIG_NUMA
@@ -2436,7 +2455,7 @@ void show_free_areas(void)
 			" all_unreclaimable? %s"
 			"\n",
 			zone->name,
-			K(zone_nr_free_pages(zone)),
+			K(zone_page_state(zone, NR_FREE_PAGES)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c5dfabf..ba0c70a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2082,7 +2082,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 		if (zone->all_unreclaimable)
 			continue;
 
-		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
+		if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
 								0, 0))
 			return 1;
 	}
@@ -2378,7 +2378,9 @@ static int kswapd(void *p)
 				 */
 				if (!sleeping_prematurely(pgdat, order, remaining)) {
 					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
+					enable_pgdat_percpu_threshold(pgdat);
 					schedule();
+					disable_pgdat_percpu_threshold(pgdat);
 				} else {
 					if (remaining)
 						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
@@ -2417,16 +2419,17 @@ void wakeup_kswapd(struct zone *zone, int order)
 	if (!populated_zone(zone))
 		return;
 
-	pgdat = zone->zone_pgdat;
-	if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0))
+	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 		return;
+	pgdat = zone->zone_pgdat;
 	if (pgdat->kswapd_max_order < order)
 		pgdat->kswapd_max_order = order;
-	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
-	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-		return;
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
+	if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
+		return;
+
+	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	wake_up_interruptible(&pgdat->kswapd_wait);
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 355a9e6..ddee139 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -81,6 +81,12 @@ EXPORT_SYMBOL(vm_stat);
 
 #ifdef CONFIG_SMP
 
+static int calculate_pressure_threshold(struct zone *zone)
+{
+	return max(1, (int)((high_wmark_pages(zone) - low_wmark_pages(zone) /
+				num_online_cpus())));
+}
+
 static int calculate_threshold(struct zone *zone)
 {
 	int threshold;
@@ -159,6 +165,44 @@ static void refresh_zone_stat_thresholds(void)
 	}
 }
 
+void disable_pgdat_percpu_threshold(pg_data_t *pgdat)
+{
+	struct zone *zone;
+	int cpu;
+	int threshold;
+	int i;
+
+	for (i = 0; i < pgdat->nr_zones; i++) {
+		zone = &pgdat->node_zones[i];
+		if (!zone->percpu_drift_mark)
+			continue;
+
+		threshold = calculate_pressure_threshold(zone);
+		for_each_online_cpu(cpu)
+			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+							= threshold;
+	}
+}
+
+void enable_pgdat_percpu_threshold(pg_data_t *pgdat)
+{
+	struct zone *zone;
+	int cpu;
+	int threshold;
+	int i;
+
+	for (i = 0; i < pgdat->nr_zones; i++) {
+		zone = &pgdat->node_zones[i];
+		if (!zone->percpu_drift_mark)
+			continue;
+
+		threshold = calculate_threshold(zone);
+		for_each_online_cpu(cpu)
+			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+							= threshold;
+	}
+}
+
 /*
  * For use when we know that interrupts are disabled.
  */
@@ -826,7 +870,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
-		   zone_nr_free_pages(zone),
+		   zone_page_state(zone, NR_FREE_PAGES),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-10-22 14:12                       ` Mel Gorman
@ 2010-10-22 15:23                         ` Christoph Lameter
  2010-10-22 18:45                           ` Mel Gorman
  2010-10-22 15:27                         ` Christoph Lameter
  2010-10-25  4:46                         ` KOSAKI Motohiro
  2 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2010-10-22 15:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Shaohua Li, linux-mm, Andrew Morton,
	David Rientjes, KAMEZAWA Hiroyuki

On Fri, 22 Oct 2010, Mel Gorman wrote:

> index eaaea37..c67d333 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -254,6 +254,8 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
>  extern void __dec_zone_state(struct zone *, enum zone_stat_item);
>
>  void refresh_cpu_vm_stats(int);
> +void disable_pgdat_percpu_threshold(pg_data_t *pgdat);
> +void enable_pgdat_percpu_threshold(pg_data_t *pgdat);
>  #else /* CONFIG_SMP */

The naming is a bit misleading since disabling may only mean reducing the
treshold.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-10-22 14:12                       ` Mel Gorman
  2010-10-22 15:23                         ` Christoph Lameter
@ 2010-10-22 15:27                         ` Christoph Lameter
  2010-10-22 18:46                           ` Mel Gorman
  2010-10-25  4:46                         ` KOSAKI Motohiro
  2 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2010-10-22 15:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Shaohua Li, linux-mm, Andrew Morton,
	David Rientjes, KAMEZAWA Hiroyuki

On Fri, 22 Oct 2010, Mel Gorman wrote:

>
> +void disable_pgdat_percpu_threshold(pg_data_t *pgdat)

Call this set_pgdat_stat_threshold() and make it take a calculate_pressure
() function?

void set_pgdat_stat_threshold(pg_data_t *pgdat, int (*calculate_pressure)(struct zone *)) ?

Then  do

	set_pgdat_stat_threshold(pgdat, threshold_normal)

	set_pgdat_stat_threshold(pgdat, threshold_pressure)

?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-10-22 15:23                         ` Christoph Lameter
@ 2010-10-22 18:45                           ` Mel Gorman
  0 siblings, 0 replies; 65+ messages in thread
From: Mel Gorman @ 2010-10-22 18:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, Shaohua Li, linux-mm, Andrew Morton,
	David Rientjes, KAMEZAWA Hiroyuki

On Fri, Oct 22, 2010 at 10:23:34AM -0500, Christoph Lameter wrote:
> On Fri, 22 Oct 2010, Mel Gorman wrote:
> 
> > index eaaea37..c67d333 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -254,6 +254,8 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
> >  extern void __dec_zone_state(struct zone *, enum zone_stat_item);
> >
> >  void refresh_cpu_vm_stats(int);
> > +void disable_pgdat_percpu_threshold(pg_data_t *pgdat);
> > +void enable_pgdat_percpu_threshold(pg_data_t *pgdat);
> >  #else /* CONFIG_SMP */
> 
> The naming is a bit misleading since disabling may only mean reducing the
> treshold.
> 

Suggestions? shrink, reduce?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-10-22 15:27                         ` Christoph Lameter
@ 2010-10-22 18:46                           ` Mel Gorman
  2010-10-22 20:01                             ` Christoph Lameter
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2010-10-22 18:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, Shaohua Li, linux-mm, Andrew Morton,
	David Rientjes, KAMEZAWA Hiroyuki

On Fri, Oct 22, 2010 at 10:27:55AM -0500, Christoph Lameter wrote:
> On Fri, 22 Oct 2010, Mel Gorman wrote:
> 
> >
> > +void disable_pgdat_percpu_threshold(pg_data_t *pgdat)
> 
> Call this set_pgdat_stat_threshold() and make it take a calculate_pressure
> () function?
> 
> void set_pgdat_stat_threshold(pg_data_t *pgdat, int (*calculate_pressure)(struct zone *)) ?
> 
> Then  do
> 
> 	set_pgdat_stat_threshold(pgdat, threshold_normal)
> 
> 	set_pgdat_stat_threshold(pgdat, threshold_pressure)
> 
> ?
> 

I considered it but thought the indirection would look tortured and
hinder review. If we agree on the basic premise, I would do it as two
patches. Would that suit?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-10-22 18:46                           ` Mel Gorman
@ 2010-10-22 20:01                             ` Christoph Lameter
  0 siblings, 0 replies; 65+ messages in thread
From: Christoph Lameter @ 2010-10-22 20:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Shaohua Li, linux-mm, Andrew Morton,
	David Rientjes, KAMEZAWA Hiroyuki

On Fri, 22 Oct 2010, Mel Gorman wrote:

> I considered it but thought the indirection would look tortured and
> hinder review. If we agree on the basic premise, I would do it as two
> patches. Would that suit?

Sure. Aside from that small nitpick.

Reviewed-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-10-22 14:12                       ` Mel Gorman
  2010-10-22 15:23                         ` Christoph Lameter
  2010-10-22 15:27                         ` Christoph Lameter
@ 2010-10-25  4:46                         ` KOSAKI Motohiro
  2010-10-27  8:19                           ` Mel Gorman
  2 siblings, 1 reply; 65+ messages in thread
From: KOSAKI Motohiro @ 2010-10-25  4:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Shaohua Li, linux-mm, cl, Andrew Morton,
	David Rientjes, KAMEZAWA Hiroyuki

> - * Return 1 if free pages are above 'mark'. This takes into account the order
> + * Return true if free pages are above 'mark'. This takes into account the order
>   * of the allocation.
>   */
> -int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> -		      int classzone_idx, int alloc_flags)
> +bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> +		      int classzone_idx, int alloc_flags, long free_pages)

static?


> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c5dfabf..ba0c70a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2082,7 +2082,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
>  		if (zone->all_unreclaimable)
>  			continue;
>  
> -		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
> +		if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
>  								0, 0))
>  			return 1;
>  	}

Do we need to change balance_pgdat() too?
Otherwise, balance_pgdat() return immediately and can make semi-infinite busy loop.


> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 355a9e6..ddee139 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -81,6 +81,12 @@ EXPORT_SYMBOL(vm_stat);
>  
>  #ifdef CONFIG_SMP
>  
> +static int calculate_pressure_threshold(struct zone *zone)
> +{
> +	return max(1, (int)((high_wmark_pages(zone) - low_wmark_pages(zone) /
> +				num_online_cpus())));
> +}

On Shaohua's machine,

	CPU: 64
	MEM: 8GBx4 (=32GB)
	per-cpu vm-stat threashold: 98

	zone->min = sqrt(32x1024x1024x16)/4 = 5792 KB = 1448 pages
	zone->high - zone->low = zone->min/4 = 362pages
	pressure-vm-threshold = 362/64 ~= 5

Hrm, this reduction seems slightly dramatically (98->5). 
Shaohua, can you please rerun your problem workload on your 64cpus machine with
applying this patch?
Of cource, If there is no performance degression, I'm not against this one.

Thanks.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: zone state overhead
  2010-10-25  4:46                         ` KOSAKI Motohiro
@ 2010-10-27  8:19                           ` Mel Gorman
  0 siblings, 0 replies; 65+ messages in thread
From: Mel Gorman @ 2010-10-27  8:19 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Shaohua Li, linux-mm, cl, Andrew Morton, David Rientjes,
	KAMEZAWA Hiroyuki

On Mon, Oct 25, 2010 at 01:46:19PM +0900, KOSAKI Motohiro wrote:
> > - * Return 1 if free pages are above 'mark'. This takes into account the order
> > + * Return true if free pages are above 'mark'. This takes into account the order
> >   * of the allocation.
> >   */
> > -int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> > -		      int classzone_idx, int alloc_flags)
> > +bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> > +		      int classzone_idx, int alloc_flags, long free_pages)
> 
> static?
> 

Yes, it should be.

> 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index c5dfabf..ba0c70a 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2082,7 +2082,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
> >  		if (zone->all_unreclaimable)
> >  			continue;
> >  
> > -		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
> > +		if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
> >  								0, 0))
> >  			return 1;
> >  	}
> 
> Do we need to change balance_pgdat() too?
> Otherwise, balance_pgdat() return immediately and can make semi-infinite busy loop.
> 

While balance_pgdat is calling zone_watermark_ok() the thresholds are
very low and the expected level of drift is minimal.  I considered the
semi-infinite busy loop to have a worst-case situation of 2 seconds until
the vmstat counters were synced and zone_watermark_ok* values matched.
There is an reasonable expectation that normal allocate/free activity would
sync the values for zone_watermark_ok* before that timeout.

To my surprise though, using zone_watermark_ok_safe() in balance_pgdat()
does not significantly increase the amount of time spent in the _safe()
function so it'll be called in the next version.

> 
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 355a9e6..ddee139 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -81,6 +81,12 @@ EXPORT_SYMBOL(vm_stat);
> >  
> >  #ifdef CONFIG_SMP
> >  
> > +static int calculate_pressure_threshold(struct zone *zone)
> > +{
> > +	return max(1, (int)((high_wmark_pages(zone) - low_wmark_pages(zone) /
> > +				num_online_cpus())));
> > +}
> 
> On Shaohua's machine,
> 
> 	CPU: 64
> 	MEM: 8GBx4 (=32GB)
> 	per-cpu vm-stat threashold: 98
> 
> 	zone->min = sqrt(32x1024x1024x16)/4 = 5792 KB = 1448 pages
> 	zone->high - zone->low = zone->min/4 = 362pages
> 	pressure-vm-threshold = 362/64 ~= 5
> 
> Hrm, this reduction seems slightly dramatically (98->5). 

Yes, but consider the maximum possible drift;

	percpu-maximum-drift = 5*64 = 320

The value is massively reduced and the cost goes up but this is the value
necessary to avoid a situation where the high watermark is "ok" when in fact
the min watermark can be breached.

> Shaohua, can you please rerun your problem workload on your 64cpus machine with
> applying this patch?
> Of cource, If there is no performance degression, I'm not against this one.
> 

Your patches that adjusted min and high may allow this threshold to grow again.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2010-10-27  8:19 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-28  5:08 zone state overhead Shaohua Li
2010-09-28 12:39 ` Christoph Lameter
2010-09-28 13:30   ` Mel Gorman
2010-09-28 13:40     ` Christoph Lameter
2010-09-28 13:51       ` Mel Gorman
2010-09-28 14:08         ` Christoph Lameter
2010-09-29  3:02           ` Shaohua Li
2010-09-29  4:02     ` David Rientjes
2010-09-29  4:47       ` Shaohua Li
2010-09-29  5:06         ` David Rientjes
2010-09-29 10:03       ` Mel Gorman
2010-09-29 14:12         ` Christoph Lameter
2010-09-29 14:17           ` Mel Gorman
2010-09-29 14:34             ` Christoph Lameter
2010-09-29 14:41               ` Mel Gorman
2010-09-29 14:45                 ` Mel Gorman
2010-09-29 14:54                   ` Christoph Lameter
2010-09-29 14:52                 ` Christoph Lameter
2010-09-29 19:44         ` David Rientjes
2010-10-08 15:29 ` Mel Gorman
2010-10-09  0:58   ` Shaohua Li
2010-10-11  8:56     ` Mel Gorman
2010-10-12  1:05       ` Shaohua Li
2010-10-12 16:25         ` Mel Gorman
2010-10-13  2:41           ` Shaohua Li
2010-10-13 12:09             ` Mel Gorman
2010-10-13  3:36           ` KOSAKI Motohiro
2010-10-13  6:25             ` [RFC][PATCH 0/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot() KOSAKI Motohiro
2010-10-13  6:27               ` [RFC][PATCH 1/3] mm, mem-hotplug: recalculate lowmem_reserve when memory hotplug occur KOSAKI Motohiro
2010-10-13  6:39                 ` KAMEZAWA Hiroyuki
2010-10-13 12:59                 ` Mel Gorman
2010-10-14  2:44                   ` KOSAKI Motohiro
2010-10-13  6:28               ` [RFC][PATCH 2/3] mm: update pcp->stat_threshold " KOSAKI Motohiro
2010-10-13  6:40                 ` KAMEZAWA Hiroyuki
2010-10-13 13:02                 ` Mel Gorman
2010-10-13  6:32               ` [RFC][PATCH 3/3] mm: reserve max drift pages at boot time instead using zone_page_state_snapshot() KOSAKI Motohiro
2010-10-13 13:19                 ` Mel Gorman
2010-10-14  2:39                   ` KOSAKI Motohiro
2010-10-18 10:43                     ` Mel Gorman
2010-10-13  7:10               ` [experimental][PATCH] mm,vmstat: per cpu stat flush too when per cpu page cache flushed KOSAKI Motohiro
2010-10-13  7:16                 ` KAMEZAWA Hiroyuki
2010-10-13 13:22                 ` Mel Gorman
2010-10-14  2:50                   ` KOSAKI Motohiro
2010-10-15 17:31                     ` Christoph Lameter
2010-10-18  9:27                       ` KOSAKI Motohiro
2010-10-18 15:44                         ` Christoph Lameter
2010-10-19  1:10                           ` KOSAKI Motohiro
2010-10-18 11:08                     ` Mel Gorman
2010-10-19  1:34                       ` KOSAKI Motohiro
2010-10-19  9:06                         ` Mel Gorman
2010-10-18 15:51                 ` Christoph Lameter
2010-10-19  0:43                   ` KOSAKI Motohiro
2010-10-13 11:24             ` zone state overhead Mel Gorman
2010-10-14  3:07               ` KOSAKI Motohiro
2010-10-18 10:39                 ` Mel Gorman
2010-10-19  1:16                   ` KOSAKI Motohiro
2010-10-19  9:08                     ` Mel Gorman
2010-10-22 14:12                       ` Mel Gorman
2010-10-22 15:23                         ` Christoph Lameter
2010-10-22 18:45                           ` Mel Gorman
2010-10-22 15:27                         ` Christoph Lameter
2010-10-22 18:46                           ` Mel Gorman
2010-10-22 20:01                             ` Christoph Lameter
2010-10-25  4:46                         ` KOSAKI Motohiro
2010-10-27  8:19                           ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.