too big min_free

All of lore.kernel.org
 help / color / mirror / Atom feed

* too big min_free_kbytes
@ 2011-01-24  3:56 Shaohua Li
  2011-01-24 15:00 ` Andrea Arcangeli
  0 siblings, 1 reply; 52+ messages in thread
From: Shaohua Li @ 2011-01-24  3:56 UTC (permalink / raw)
  To: Andrew Morton, aarcange; +Cc: linux-mm, Chen, Tim C

Hi,
With transparent huge page, min_free_kbytes is set too big.
Before:
Node 0, zone    DMA32
  pages free     1812
        min      1424
        low      1780
        high     2136
        scanned  0
        spanned  519168
        present  511496

After:
Node 0, zone    DMA32
  pages free     482708
        min      11178
        low      13972
        high     16767
        scanned  0
        spanned  519168
        present  511496
This caused different performance problems in our test. I wonder why we
set the value so big.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-24  3:56 too big min_free_kbytes Shaohua Li
@ 2011-01-24 15:00 ` Andrea Arcangeli
  2011-01-25 14:35   ` Mel Gorman
  2011-01-26 14:17   ` Mel Gorman
  0 siblings, 2 replies; 52+ messages in thread
From: Andrea Arcangeli @ 2011-01-24 15:00 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Andrew Morton, linux-mm, Chen, Tim C, Mel Gorman

eOn Mon, Jan 24, 2011 at 11:56:46AM +0800, Shaohua Li wrote:
> Hi,
> With transparent huge page, min_free_kbytes is set too big.
> Before:
> Node 0, zone    DMA32
>   pages free     1812
>         min      1424
>         low      1780
>         high     2136
>         scanned  0
>         spanned  519168
>         present  511496
> 
> After:
> Node 0, zone    DMA32
>   pages free     482708
>         min      11178
>         low      13972
>         high     16767
>         scanned  0
>         spanned  519168
>         present  511496
> This caused different performance problems in our test. I wonder why we
> set the value so big.

It's to enable Mel's anti-frag that keeps pageblocks with movable and
unmovable stuff separated, same as "hugeadm
--set-recommended-min_free_kbytes".

Now that I checked, I'm seeing quite too much free memory with only 4G
of ram... You can see the difference with a "cp /dev/sda /dev/null" in
background interleaving these two commands:

echo always >/sys/kernel/mm/transparent_hugepage/enabled
echo 1000 > /proc/sys/vm/min_free_kbytes

The setting of min_free_kbytes to 67584 leads to 716MB of memory
free. Setting to 1000 leads to 20MB free. I'm afraid losing 716MB on a
4G system is way excessive regardless of THP... can't we just have a
version of anti-frag that reserves a lot fewers pageblocks? Anti-frag
is quite important to avoid slab to fragment everything. I don't think
we can leave it like this.

For now you can workaround with the above echo 1000 > ...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-24 15:00 ` Andrea Arcangeli
@ 2011-01-25 14:35   ` Mel Gorman
  2011-01-26 14:17   ` Mel Gorman
  1 sibling, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2011-01-25 14:35 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

Sorry for the long delay in replying. I've been out the last week and am
not properly back until tomorrow.

On Mon, Jan 24, 2011 at 04:00:34PM +0100, Andrea Arcangeli wrote:
> eOn Mon, Jan 24, 2011 at 11:56:46AM +0800, Shaohua Li wrote:
> > Hi,
> > With transparent huge page, min_free_kbytes is set too big.
> > Before:
> > Node 0, zone    DMA32
> >   pages free     1812
> >         min      1424
> >         low      1780
> >         high     2136
> >         scanned  0
> >         spanned  519168
> >         present  511496
> > 
> > After:
> > Node 0, zone    DMA32
> >   pages free     482708
> >         min      11178
> >         low      13972
> >         high     16767
> >         scanned  0
> >         spanned  519168
> >         present  511496
> > This caused different performance problems in our test. I wonder why we
> > set the value so big.
> 
> It's to enable Mel's anti-frag that keeps pageblocks with movable and
> unmovable stuff separated, same as "hugeadm
> --set-recommended-min_free_kbytes".
> 

It's not so much "make it work" as "make it work better". The effect can
be measured by recording the mm_page_alloc_extfrag event. The more times
it occurs, the worse fragmentation can get. The event also reports
whether it is severe or not.

> Now that I checked, I'm seeing quite too much free memory with only 4G
> of ram... You can see the difference with a "cp /dev/sda /dev/null" in
> background interleaving these two commands:
> 

There is more than just min_free_kbytes happening here. The high
watermark goes to 16M-ish but the amount of free memory is *way* above
that watermark. Something is causing page reclaim to be a lot more
agressive than it should be.

Is there a difference with THP enabled and disabled but leaving
min_free_kbytes alone? My preliminary theory is that 2M pages are being
requested and kswapd is being woken up when it shouldn't
(__GFP_NO_KSWAPD not specified when it should be). Unfortunately I do
not have access to source at the moment to double check.

> echo always >/sys/kernel/mm/transparent_hugepage/enabled
> echo 1000 > /proc/sys/vm/min_free_kbytes
> 
> The setting of min_free_kbytes to 67584 leads to 716MB of memory
> free. Setting to 1000 leads to 20MB free. I'm afraid losing 716MB on a
> 4G system is way excessive regardless of THP...

Agreed.

> can't we just have a
> version of anti-frag that reserves a lot fewers pageblocks?

Anti-frag doesn't really take any additional special action due to
min_free_kbytes and it shouldn't be clearing out pageblocks
aggressively like this. I think it would also be worth checking how
often the mm_vmscan_kswapd_wake and mm_vmscan_wakeup_kswapd trace events
are triggering. If mm_vmscan_wakeup_kswapd is triggering a lot, a stack
trace of the most common triggering event might give a clue as to what
is going wrong.

> Anti-frag
> is quite important to avoid slab to fragment everything. I don't think
> we can leave it like this.
> 
> For now you can workaround with the above echo 1000 > ...
> 

Agreed. I'll try find time to investigate before the week is out but
after being offline for a week, I've a lot of catching up to do.

-- 
Mel Gorman
Linux Technology Center
IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-24 15:00 ` Andrea Arcangeli
  2011-01-25 14:35   ` Mel Gorman
@ 2011-01-26 14:17   ` Mel Gorman
  2011-01-26 15:23     ` Mel Gorman
  1 sibling, 1 reply; 52+ messages in thread
From: Mel Gorman @ 2011-01-26 14:17 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On Mon, Jan 24, 2011 at 04:00:34PM +0100, Andrea Arcangeli wrote:
> eOn Mon, Jan 24, 2011 at 11:56:46AM +0800, Shaohua Li wrote:
> > Hi,
> > With transparent huge page, min_free_kbytes is set too big.
> > Before:
> > Node 0, zone    DMA32
> >   pages free     1812
> >         min      1424
> >         low      1780
> >         high     2136
> >         scanned  0
> >         spanned  519168
> >         present  511496
> > 
> > After:
> > Node 0, zone    DMA32
> >   pages free     482708
> >         min      11178
> >         low      13972
> >         high     16767
> >         scanned  0
> >         spanned  519168
> >         present  511496
> > This caused different performance problems in our test. I wonder why we
> > set the value so big.
> 
> It's to enable Mel's anti-frag that keeps pageblocks with movable and
> unmovable stuff separated, same as "hugeadm
> --set-recommended-min_free_kbytes".
> 
> Now that I checked, I'm seeing quite too much free memory with only 4G
> of ram... You can see the difference with a "cp /dev/sda /dev/null" in
> background interleaving these two commands:
> 

What kernel is this and is commit
[99504748: mm: kswapd: stop high-order balancing when any suitable zone
is balanced] present in the kernel you are testing?

I'm having very little luck reproducing your scenario with
2.6.38-rc2. min_free_kbytes is as expected and the free memory is close to
expectations when copying /dev/sda to /dev/null with or without transparent
hugepages.

-- 
Mel Gorman
Linux Technology Center
IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-26 14:17   ` Mel Gorman
@ 2011-01-26 15:23     ` Mel Gorman
  2011-01-26 15:42       ` Andrea Arcangeli
  0 siblings, 1 reply; 52+ messages in thread
From: Mel Gorman @ 2011-01-26 15:23 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On Wed, Jan 26, 2011 at 02:17:46PM +0000, Mel Gorman wrote:
> On Mon, Jan 24, 2011 at 04:00:34PM +0100, Andrea Arcangeli wrote:
> > eOn Mon, Jan 24, 2011 at 11:56:46AM +0800, Shaohua Li wrote:
> > > Hi,
> > > With transparent huge page, min_free_kbytes is set too big.
> > > Before:
> > > Node 0, zone    DMA32
> > >   pages free     1812
> > >         min      1424
> > >         low      1780
> > >         high     2136
> > >         scanned  0
> > >         spanned  519168
> > >         present  511496
> > > 
> > > After:
> > > Node 0, zone    DMA32
> > >   pages free     482708
> > >         min      11178
> > >         low      13972
> > >         high     16767
> > >         scanned  0
> > >         spanned  519168
> > >         present  511496
> > > This caused different performance problems in our test. I wonder why we
> > > set the value so big.
> > 
> > It's to enable Mel's anti-frag that keeps pageblocks with movable and
> > unmovable stuff separated, same as "hugeadm
> > --set-recommended-min_free_kbytes".
> > 
> > Now that I checked, I'm seeing quite too much free memory with only 4G
> > of ram... You can see the difference with a "cp /dev/sda /dev/null" in
> > background interleaving these two commands:
> > 
> 
> What kernel is this and is commit
> [99504748: mm: kswapd: stop high-order balancing when any suitable zone
> is balanced] present in the kernel you are testing?
> 
> I'm having very little luck reproducing your scenario with
> 2.6.38-rc2.

Scratch that, a machine with 4G does reproduce it. The machine I was
trying was 2G. Will dig more.

-- 
Mel Gorman
Linux Technology Center
IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-26 15:23     ` Mel Gorman
@ 2011-01-26 15:42       ` Andrea Arcangeli
  2011-01-26 16:36         ` Mel Gorman
  0 siblings, 1 reply; 52+ messages in thread
From: Andrea Arcangeli @ 2011-01-26 15:42 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On Wed, Jan 26, 2011 at 03:23:02PM +0000, Mel Gorman wrote:
> On Wed, Jan 26, 2011 at 02:17:46PM +0000, Mel Gorman wrote:
> > On Mon, Jan 24, 2011 at 04:00:34PM +0100, Andrea Arcangeli wrote:
> > > eOn Mon, Jan 24, 2011 at 11:56:46AM +0800, Shaohua Li wrote:
> > > > Hi,
> > > > With transparent huge page, min_free_kbytes is set too big.
> > > > Before:
> > > > Node 0, zone    DMA32
> > > >   pages free     1812
> > > >         min      1424
> > > >         low      1780
> > > >         high     2136
> > > >         scanned  0
> > > >         spanned  519168
> > > >         present  511496
> > > > 
> > > > After:
> > > > Node 0, zone    DMA32
> > > >   pages free     482708
> > > >         min      11178
> > > >         low      13972
> > > >         high     16767
> > > >         scanned  0
> > > >         spanned  519168
> > > >         present  511496
> > > > This caused different performance problems in our test. I wonder why we
> > > > set the value so big.
> > > 
> > > It's to enable Mel's anti-frag that keeps pageblocks with movable and
> > > unmovable stuff separated, same as "hugeadm
> > > --set-recommended-min_free_kbytes".
> > > 
> > > Now that I checked, I'm seeing quite too much free memory with only 4G
> > > of ram... You can see the difference with a "cp /dev/sda /dev/null" in
> > > background interleaving these two commands:
> > > 
> > 
> > What kernel is this and is commit
> > [99504748: mm: kswapd: stop high-order balancing when any suitable zone
> > is balanced] present in the kernel you are testing?
> > 
> > I'm having very little luck reproducing your scenario with
> > 2.6.38-rc2.
> 
> Scratch that, a machine with 4G does reproduce it. The machine I was
> trying was 2G. Will dig more.

I can't reproduce on a 16G system (there I never get more than an
hundred mbyte free even with cp in background, which is very fine for
16G).

I only reproduce on my 4G workstation, and it happens also after echo
never >enabled (so without THP). I was reproducing it with "cp" anyway
which isn't triggering THP allocations but I verified to be sure. When
I start cp kswapd wasn't running yet, so free levels go down to 170M,
then kswapd starts and it frees 700M and then 700m remains free
forever until I stop "cp". The high wmark are never set to more than
85M for the normal zone, which is not excessively horrible. I'd still
like to lower the wmark though!  (there are 2 pageblocks reserved in
the min watermark for each type, why not just 1? removing that *2
would already halve it saving some 40M of ram!). But the wmarks don't
seem the real offender, maybe it's something related to the tiny pci32
zone that materialize on 4g systems that relocate some little memory
over 4g to make space for the pci32 mmio. I didn't yet finish to debug
it.

However in presence of memory pressure the low wmark is the limit not
the high wmark (and when kswapd isn't running free levels already go
down to 170M even where I can reproduce). Maybe the failure with too
much memory free may be only because of the increased wmark from some
20M to ~100M, and maybe I'm seeing something unrelated to that
problem. __GFP_NO_KSWAPD I exclude is the issue as it happens without
THP too and there's just one place where huge_memory.c allocates
memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-26 15:42       ` Andrea Arcangeli
@ 2011-01-26 16:36         ` Mel Gorman
  2011-01-26 17:42           ` Mel Gorman
  0 siblings, 1 reply; 52+ messages in thread
From: Mel Gorman @ 2011-01-26 16:36 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On Wed, Jan 26, 2011 at 04:42:03PM +0100, Andrea Arcangeli wrote:
> On Wed, Jan 26, 2011 at 03:23:02PM +0000, Mel Gorman wrote:
> > On Wed, Jan 26, 2011 at 02:17:46PM +0000, Mel Gorman wrote:
> > > On Mon, Jan 24, 2011 at 04:00:34PM +0100, Andrea Arcangeli wrote:
> > > > eOn Mon, Jan 24, 2011 at 11:56:46AM +0800, Shaohua Li wrote:
> > > > > Hi,
> > > > > With transparent huge page, min_free_kbytes is set too big.
> > > > > Before:
> > > > > Node 0, zone    DMA32
> > > > >   pages free     1812
> > > > >         min      1424
> > > > >         low      1780
> > > > >         high     2136
> > > > >         scanned  0
> > > > >         spanned  519168
> > > > >         present  511496
> > > > > 
> > > > > After:
> > > > > Node 0, zone    DMA32
> > > > >   pages free     482708
> > > > >         min      11178
> > > > >         low      13972
> > > > >         high     16767
> > > > >         scanned  0
> > > > >         spanned  519168
> > > > >         present  511496
> > > > > This caused different performance problems in our test. I wonder why we
> > > > > set the value so big.
> > > > 
> > > > It's to enable Mel's anti-frag that keeps pageblocks with movable and
> > > > unmovable stuff separated, same as "hugeadm
> > > > --set-recommended-min_free_kbytes".
> > > > 
> > > > Now that I checked, I'm seeing quite too much free memory with only 4G
> > > > of ram... You can see the difference with a "cp /dev/sda /dev/null" in
> > > > background interleaving these two commands:
> > > > 
> > > 
> > > What kernel is this and is commit
> > > [99504748: mm: kswapd: stop high-order balancing when any suitable zone
> > > is balanced] present in the kernel you are testing?
> > > 
> > > I'm having very little luck reproducing your scenario with
> > > 2.6.38-rc2.
> > 
> > Scratch that, a machine with 4G does reproduce it. The machine I was
> > trying was 2G. Will dig more.
> 
> I can't reproduce on a 16G system (there I never get more than an
> hundred mbyte free even with cp in background, which is very fine for
> 16G).
> 

It's a balancing problem in kswapd. From my preliminary examination
using ftrace, I determined that kswapd is never trying to go to sleep
and continually shrinking lists so it must be stuck in balance_pgdat().

> I only reproduce on my 4G workstation, and it happens also after echo
> never >enabled (so without THP). I was reproducing it with "cp" anyway
> which isn't triggering THP allocations but I verified to be sure. When
> I start cp kswapd wasn't running yet, so free levels go down to 170M,
> then kswapd starts and it frees 700M and then 700m remains free
> forever until I stop "cp".

This has nothing to do with THP. It should be possible to trigger on any
4G machine or specifically where the top zone is very small.

> The high wmark are never set to more than
> 85M for the normal zone, which is not excessively horrible. I'd still
> like to lower the wmark though!  (there are 2 pageblocks reserved in
> the min watermark for each type, why not just 1? removing that *2
> would already halve it saving some 40M of ram!).

This is a separate topic, lets not get side-tracked. Short answer, it
comes down to at the time when no pageblock of the appropriate
migratetype is free, we want on average one full pageblock to be free of
another type so it can be converted. This limits the amount of "mixing"
of pages of different migratetype in the same pageblock. The effect can
be monitored using the extfrag tracepoint.

> But the wmarks don't
> seem the real offender, maybe it's something related to the tiny pci32
> zone that materialize on 4g systems that relocate some little memory
> over 4g to make space for the pci32 mmio. I didn't yet finish to debug
> it.
> 

This has to be it. What I think is happening is that we're in balance_pgdat(),
the "Normal" zone is never hitting the watermark and we constantly call
"goto loop_again" trying to "rebalance" all zones.

> However in presence of memory pressure the low wmark is the limit not
> the high wmark (and when kswapd isn't running free levels already go
> down to 170M even where I can reproduce). Maybe the failure with too
> much memory free may be only because of the increased wmark from some
> 20M to ~100M, and maybe I'm seeing something unrelated to that
> problem.

I very strongly suspect it's just because your Normal zone is never being
balanced and kswapd is never breaking out of balance_pgdat() as a result. I
hope to confirm before I get knocked back offline (my access to test machines
is currently heavily disrupted).

> __GFP_NO_KSWAPD I exclude is the issue as it happens without
> THP too and there's just one place where huge_memory.c allocates
> memory.

Agreed, it's nothing to do with __GFP_NO_KSWAPD from what I've seen so
far.

-- 
Mel Gorman
Linux Technology Center
IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-26 16:36         ` Mel Gorman
@ 2011-01-26 17:42           ` Mel Gorman
  2011-01-27 13:40             ` Mel Gorman
  0 siblings, 1 reply; 52+ messages in thread
From: Mel Gorman @ 2011-01-26 17:42 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On Wed, Jan 26, 2011 at 04:36:55PM +0000, Mel Gorman wrote:
> > But the wmarks don't
> > seem the real offender, maybe it's something related to the tiny pci32
> > zone that materialize on 4g systems that relocate some little memory
> > over 4g to make space for the pci32 mmio. I didn't yet finish to debug
> > it.
> > 
> 
> This has to be it. What I think is happening is that we're in balance_pgdat(),
> the "Normal" zone is never hitting the watermark and we constantly call
> "goto loop_again" trying to "rebalance" all zones.
> 

Confirmed. The following "patch" should fix allow the number of free pages to
drop to a sensible level. Note, this is not intended as a fix because it's
the utterly wrong approach to take. It's only to illustrate where things
are going wrong when the top-most zone is very small.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f5d90de..477cb77 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2259,7 +2259,8 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 		}
 
 		if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
-							classzone_idx, 0))
+							classzone_idx, 0) &&
+				zone->present_pages >= pgdat->node_present_pages >> 2)
 			all_zones_ok = false;
 		else
 			balanced += zone->present_pages;
@@ -2446,15 +2447,18 @@ loop_again:
 
 			if (!zone_watermark_ok_safe(zone, order,
 					high_wmark_pages(zone), end_zone, 0)) {
-				all_zones_ok = 0;
-				/*
-				 * We are still under min water mark.  This
-				 * means that we have a GFP_ATOMIC allocation
-				 * failure risk. Hurry up!
-				 */
-				if (!zone_watermark_ok_safe(zone, order,
-					    min_wmark_pages(zone), end_zone, 0))
-					has_under_min_watermark_zone = 1;
+				if (zone->present_pages >= pgdat->node_present_pages >> 2) {
+					all_zones_ok = 0;
+
+					/*
+					 * We are still under min water mark.  This
+					 * means that we have a GFP_ATOMIC allocation
+					 * failure risk. Hurry up!
+					 */
+					if (!zone_watermark_ok_safe(zone, order,
+						    min_wmark_pages(zone), end_zone, 0))
+						has_under_min_watermark_zone = 1;
+				}
 			} else {
 				/*
 				 * If a zone reaches its high watermark,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-26 17:42           ` Mel Gorman
@ 2011-01-27 13:40             ` Mel Gorman
  2011-01-27 15:27               ` Andrea Arcangeli
  0 siblings, 1 reply; 52+ messages in thread
From: Mel Gorman @ 2011-01-27 13:40 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On Wed, Jan 26, 2011 at 05:42:37PM +0000, Mel Gorman wrote:
> On Wed, Jan 26, 2011 at 04:36:55PM +0000, Mel Gorman wrote:
> > > But the wmarks don't
> > > seem the real offender, maybe it's something related to the tiny pci32
> > > zone that materialize on 4g systems that relocate some little memory
> > > over 4g to make space for the pci32 mmio. I didn't yet finish to debug
> > > it.
> > > 
> > 
> > This has to be it. What I think is happening is that we're in balance_pgdat(),
> > the "Normal" zone is never hitting the watermark and we constantly call
> > "goto loop_again" trying to "rebalance" all zones.
> > 
> 
> Confirmed.
> <SNIP>

How about the following? Functionally it would work but I am concerned
that the logic in balance_pgdat() and kswapd() is getting out of hand
having being adjusted to work with a number of corner cases already. In
the next cycle, it could do with a "do-over" attempt to make it easier
to follow.

==== CUT HERE ====
mm: kswapd: Do not reclaim excessive pages from already balanced zones

When reclaiming for order-0 pages, kswapd requires that all zones be
balanced. Each cycle through balance_pgdat() does background ageing on all
zones if necessary and applies equal pressure on the inactive zone unless
a lot of pages are free already.

A "lot of free pages" is defined as 8*high_watermark which historically has
been reasonably fine as min_free_kbytes was small. However, on systems using
huge pages, it is recommended that min_free_kbytes is higher and it is tuned
with hugeadm --set-recommended-min_free_kbytes. With the introduction of
transparent huge page support, this recommended value is also applied. The
problem then is in the corner cases.

On X86-64 with 4G of memory, min_free_kbytes becomes 67584 so one would
expect around 68M of memory to be free. The Normal zone is approximately
35000 pages so under even normal memory pressure such as copying a large
file, it gets exhausted quickly. As it is getting exhausted, kswapd
applies pressure equally to all zones, including the DMA32 zone. DMA32 is
approximately 700,000 pages with a high watermark of around 23,000 pages. In
this situation, kswapd will reclaim around (23000*8) pages or 718M of pages
before the zone is ignored. What the user sees is kswapd constantly stuck
in D state and free memory far higher than it should be.

This patch addresses the problem by taking into account if kswapd is looping
in balance_pgdat() when deciding if a zone is balanced or not.  If the zone
is relatively small and kswapd is looping or preparing to sleep, then the
zone is considered balanced. If an allocator has hit the low watermark,
kswapd will stay awake (pgdat->kswapd_max_order or classzone_idx) will be
set and reread or will get woken later when real memory pressure exists.

Using a very basic test of cp /dev/sda6 /dev/null where sda6 was an 80G
partition, the amount of free memory without this patch hovered around
the 700M mark and around the 90M mark when applied which is closer to
expectations for the larger default min_free_kbytes with THP enabled.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   44 ++++++++++++++++++++++++++++++++++++++------
 1 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f5d90de..3d4ffd3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2228,6 +2228,35 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
 	return balanced_pages > (present_pages >> 2);
 }

+static bool zone_balanced(struct zone *zone, int order, unsigned long mark,
+				int classzone_idx, bool firstscan)
+{
+	pg_data_t *pgdat = zone->zone_pgdat;
+
+	/*
+	 * If this is a relatively small zone and kswapd is looping
+	 * for order-0 pages, consider the zone to be balanced so
+	 * kswapd has a chance to go back to sleep. Direct reclaimers
+	 * will wake kswapd again if necessary. Otherwise there is a
+	 * risk that kswapd will reclaim an excessive number of pages
+	 * from larger zones even when allocators do not require it
+	 * due to balance_pgdat reclaiming pages from each zone unless
+	 * free pages > 8*high_watermark which is potentially a large
+	 * number of pages. 
+	 *
+	 * Small is considered to be node_present_pages >> 2 due to
+	 * the "free pages > 8*high_watermark" heuristic. The 
+	 * smallest possible low zone (DMA) and a small high zone
+	 * should in combination be related to the maximum amount
+	 * of memory kswapd will reclaim from the other zones.
+	 */
+	if (!firstscan && order == 0 &&
+			zone->present_pages < pgdat->node_present_pages >> 2)
+		return true;
+
+	return zone_watermark_ok_safe(zone, order, mark, classzone_idx, 0);
+}
+
 /* is kswapd sleeping prematurely? */
 static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 					int classzone_idx)
@@ -2258,8 +2287,8 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 			continue;
 		}

-		if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
-							classzone_idx, 0))
+		if (!zone_balanced(zone, order, high_wmark_pages(zone),
+							classzone_idx, false))
 			all_zones_ok = false;
 		else
 			balanced += zone->present_pages;
@@ -2306,6 +2335,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 	unsigned long total_scanned;
+	bool firstscan;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
@@ -2444,16 +2474,16 @@ loop_again:
 			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
 				sc.may_writepage = 1;

-			if (!zone_watermark_ok_safe(zone, order,
-					high_wmark_pages(zone), end_zone, 0)) {
+			if (!zone_balanced(zone, order,
+					high_wmark_pages(zone), end_zone, firstscan)) {
 				all_zones_ok = 0;
 				/*
 				 * We are still under min water mark.  This
 				 * means that we have a GFP_ATOMIC allocation
 				 * failure risk. Hurry up!
 				 */
-				if (!zone_watermark_ok_safe(zone, order,
-					    min_wmark_pages(zone), end_zone, 0))
+				if (!zone_balanced(zone, order,
+					    min_wmark_pages(zone), end_zone, firstscan))
 					has_under_min_watermark_zone = 1;
 			} else {
 				/*
@@ -2520,6 +2550,8 @@ out:
 		if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
 			order = sc.order = 0;

+		firstscan = false;
+
 		goto loop_again;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-27 13:40             ` Mel Gorman
@ 2011-01-27 15:27               ` Andrea Arcangeli
  2011-01-27 16:03                 ` Mel Gorman
                                   ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Andrea Arcangeli @ 2011-01-27 15:27 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On Thu, Jan 27, 2011 at 01:40:58PM +0000, Mel Gorman wrote:
> On Wed, Jan 26, 2011 at 05:42:37PM +0000, Mel Gorman wrote:
> > On Wed, Jan 26, 2011 at 04:36:55PM +0000, Mel Gorman wrote:
> > > > But the wmarks don't
> > > > seem the real offender, maybe it's something related to the tiny pci32
> > > > zone that materialize on 4g systems that relocate some little memory
> > > > over 4g to make space for the pci32 mmio. I didn't yet finish to debug
> > > > it.
> > > > 
> > > 
> > > This has to be it. What I think is happening is that we're in balance_pgdat(),
> > > the "Normal" zone is never hitting the watermark and we constantly call
> > > "goto loop_again" trying to "rebalance" all zones.
> > > 
> > 
> > Confirmed.
> > <SNIP>
> 
> How about the following? Functionally it would work but I am concerned
> that the logic in balance_pgdat() and kswapd() is getting out of hand
> having being adjusted to work with a number of corner cases already. In
> the next cycle, it could do with a "do-over" attempt to make it easier
> to follow.

That number 8 is the problem, I don't think anybody was ever supposed
to free 8*highwmark pages. kswapd must work in the hysteresis range
low->high area and then sleep wait low to hit again before it gets
wakenup. Not sure how that number 8 ever come up... but to be it looks
like the real offender and I wouldn't work around it.

totally untested... I will test....

====
Subject: vmscan: kswapd must not free more than high_wmark pages

From: Andrea Arcangeli <aarcange@redhat.com>

When the min_free_kbytes is set with `hugeadm
--set-recommended-min_free_kbytes" or with THP enabled (which runs the
equivalent of "hugeadm --set-recommended-min_free_kbytes" to activate
anti-frag at full effectiveness automatically at boot) the high wmark
of some zone is as high as ~88M. 88M free on a 4G system isn't
horrible, but 88M*8 = 704M free on a 4G system is definitely
unbearable. This only tends to be visible on 4G systems with tiny
over-4g zone where kswapd insists to reach the high wmark on the
over-4g zone but doing so it shrunk up to 704M from the normal zone by
mistake.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---


diff --git a/mm/vmscan.c b/mm/vmscan.c
index f5d90de..9e3c78e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2407,7 +2407,7 @@ loop_again:
 			 * zone has way too many pages free already.
 			 */
 			if (!zone_watermark_ok_safe(zone, order,
-					8*high_wmark_pages(zone), end_zone, 0))
+					high_wmark_pages(zone), end_zone, 0))
 				shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-27 15:27               ` Andrea Arcangeli
@ 2011-01-27 16:03                 ` Mel Gorman
  2011-01-27 18:52                   ` Andrea Arcangeli
  2011-02-03  2:58                 ` Andrea Arcangeli
  2011-02-12  9:48                 ` alex shi
  2 siblings, 1 reply; 52+ messages in thread
From: Mel Gorman @ 2011-01-27 16:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Shaohua Li, Andrew Morton, Rik van Riel, linux-mm, Chen, Tim C

On Thu, Jan 27, 2011 at 04:27:55PM +0100, Andrea Arcangeli wrote:
> On Thu, Jan 27, 2011 at 01:40:58PM +0000, Mel Gorman wrote:
> > On Wed, Jan 26, 2011 at 05:42:37PM +0000, Mel Gorman wrote:
> > > On Wed, Jan 26, 2011 at 04:36:55PM +0000, Mel Gorman wrote:
> > > > > But the wmarks don't
> > > > > seem the real offender, maybe it's something related to the tiny pci32
> > > > > zone that materialize on 4g systems that relocate some little memory
> > > > > over 4g to make space for the pci32 mmio. I didn't yet finish to debug
> > > > > it.
> > > > > 
> > > > 
> > > > This has to be it. What I think is happening is that we're in balance_pgdat(),
> > > > the "Normal" zone is never hitting the watermark and we constantly call
> > > > "goto loop_again" trying to "rebalance" all zones.
> > > > 
> > > 
> > > Confirmed.
> > > <SNIP>
> > 
> > How about the following? Functionally it would work but I am concerned
> > that the logic in balance_pgdat() and kswapd() is getting out of hand
> > having being adjusted to work with a number of corner cases already. In
> > the next cycle, it could do with a "do-over" attempt to make it easier
> > to follow.
> 
> That number 8 is the problem,

Agreed, I considered your approach as well. I didn't go with it because it
was the main heuristic that allowed kswapd to skip a zone but still allows
kswapd to keep going. I made the choice to try and put kswapd to sleep
sooner.

> I don't think anybody was ever supposed
> to free 8*highwmark pages. kswapd must work in the hysteresis range
> low->high area and then sleep wait low to hit again before it gets
> wakenup. Not sure how that number 8 ever come up... but to be it looks
> like the real offender and I wouldn't work around it.
> 

It was introduced by commit [32a4330d: mm: prevent kswapd from freeing
excessive amounts of lowmem] and sure enough, it was intended to avoid a
situation where memory was freed from every zone if one was imbalanced -
sounds familiar.

> totally untested... I will test....
> 

It should work in terms of free memory. When testing, monitor as well if
kswapd is going asleep or if it is stuck in D state. If it's stuck in D state,
it's looping around in balance_pgdat() and consuming CPU for no good reason
(can use vmscan tracepoints to confirm).

-- 
Mel Gorman
Linux Technology Center
IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-27 16:03                 ` Mel Gorman
@ 2011-01-27 18:52                   ` Andrea Arcangeli
  2011-01-27 20:33                     ` Rik van Riel
  2011-01-27 21:31                     ` Mel Gorman
  0 siblings, 2 replies; 52+ messages in thread
From: Andrea Arcangeli @ 2011-01-27 18:52 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Shaohua Li, Andrew Morton, Rik van Riel, linux-mm, Chen, Tim C

On Thu, Jan 27, 2011 at 04:03:01PM +0000, Mel Gorman wrote:
> Agreed, I considered your approach as well. I didn't go with it because it
> was the main heuristic that allowed kswapd to skip a zone but still allows
> kswapd to keep going. I made the choice to try and put kswapd to sleep
> sooner.

Ok, but a multiplication *8 remains excessive and while it may be ok
with min_free_kbytes=20M it's not ok when it's = 80M, especially when
it can be set to 80M on a 4G system that will end up with a small
over-4g zone that may not be shrunk as easily as the normal/pci32 zone
below 4g.

It's broken because this *8 adds is all about a 7*highwmark "gap".

I'm having a little trouble understanding your patch and I don't like
the magic >> 2 very much, if the node has little more than 1/4th of
the memory of the node, it'll still cause the other zones to be shrunk
8 times more than they should ever be shrunk! This will materialize
with ~mem=5g , with your patch a little more than 5g will still lead
to ~800M free by mistake. It seems more a band aid for the 4g case
than a real fix. This is why I think the real fix is to remove that *8
and create a real "balance gap ratio" that is in function of the
memory of the zone, not in function of the high wmark at all.

If we were using the old code the gap would be way smaller. The "gap"
is increasing excessively because the "high wmark" is increasing to a
fixed value in function of the pageblocks numbers, the migrate types
etc..., but from an algorithm point of view the high wmark has no
effect on the rotation of all lrus to balance the shrinking of all
zones. The high wmark is a fixed amount for all zones, the "gap"
doesn't need to increase with the high wmark.

Clearly the high wmark was used as in the old days it was a function
of the ram size, now it's not anymore. So clearly the "gap" must not
be in function of the high wmark a nymore but only in function of the
memory size! Which I think is the real fix.

> It was introduced by commit [32a4330d: mm: prevent kswapd from freeing
> excessive amounts of lowmem] and sure enough, it was intended to avoid a
> situation where memory was freed from every zone if one was imbalanced -
> sounds familiar.

Yes definitely. So it was limiting the waste to 8*high_wmark. But that
was ok because it had the assumtion wmark was a fuction of memory,
it's not ok anymore and we must make it a function of memory
explicitly to fix this.

> It should work in terms of free memory. When testing, monitor as well if
> kswapd is going asleep or if it is stuck in D state. If it's stuck in D state,
> it's looping around in balance_pgdat() and consuming CPU for no good reason
> (can use vmscan tracepoints to confirm).

I'll try another patch first to avoid disabling the balancing of all
zones that should provide for a nicer lru behavior than my previous
patch.

I am however uncertain this is really better than removing the *8 as
in my previous patch. But either this or previous patch I sent is the
solution I prefer, because this fixes it without a magic >>2 that will
break again quite badly at little more than mem=5g.

====
Subject: vmscan: kswapd must not free more than high_wmark+gap pages

From: Andrea Arcangeli <aarcange@redhat.com>

When the min_free_kbytes is set with `hugeadm
--set-recommended-min_free_kbytes" or with THP enabled (which runs the
equivalent of "hugeadm --set-recommended-min_free_kbytes" to activate
anti-frag at full effectiveness automatically at boot) the high wmark
of some zone is fixed as high as ~88M, not anymore in function of
memory size. 88M free on a 4G system isn't horrible, but 88M*8 = 704M
free on a 4G system is unbearable. This only tends to be visible on 4G
systems with tiny over-4g zone where kswapd insists to reach the high
wmark on the over-4g zone but doing so it shrunk up to 704M from the
normal zone by mistake. This patch makes the "gap" explicit in
function of memory size, because the high wmark isn't in function of
memory size anymore.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4d55932..a57c6e7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -155,6 +155,15 @@ enum {
 #define SWAP_CLUSTER_MAX 32
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX

+/*
+ * Ratio between the present memory in the zone and the "gap" that
+ * we're allowing kswapd to shrink in addition to the per-zone high
+ * wmark, even for zones that already have the high wmark satisfied,
+ * in order to provide better per-zone lru behavior. We are ok to
+ * spend not more than 1% of the memory for this zone balancing "gap".
+ */
+#define KSWAPD_ZONE_BALANCE_GAP_RATIO 100
+
 #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
 #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
 #define SWAP_HAS_CACHE	0x40	/* Flag page is cached, in first swap_map */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f5d90de..f03441e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2403,11 +2403,16 @@ loop_again:
 			mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);

 			/*
-			 * We put equal pressure on every zone, unless one
-			 * zone has way too many pages free already.
+			 * We put equal pressure on every zone, unless
+			 * one zone has way too many pages free
+			 * already. The "too many pages" is defined
+			 * as the high wmark plus a "gap".
 			 */
 			if (!zone_watermark_ok_safe(zone, order,
-					8*high_wmark_pages(zone), end_zone, 0))
+					(zone->present_pages +
+					 KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
+					 KSWAPD_ZONE_BALANCE_GAP_RATIO +
+					high_wmark_pages(zone), end_zone, 0))
 				shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-27 18:52                   ` Andrea Arcangeli
@ 2011-01-27 20:33                     ` Rik van Riel
  2011-01-27 21:31                     ` Mel Gorman
  1 sibling, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2011-01-27 20:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On 01/27/2011 01:52 PM, Andrea Arcangeli wrote:

>   			if (!zone_watermark_ok_safe(zone, order,
> -					8*high_wmark_pages(zone), end_zone, 0))
> +					(zone->present_pages +
> +					 KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
> +					 KSWAPD_ZONE_BALANCE_GAP_RATIO +
> +					high_wmark_pages(zone), end_zone, 0))
>   				shrink_zone(priority, zone,&sc);

Isn't (zone->present_pages + 99) / 100 + high_wmark_pages(zone)
pretty much guaranteed to be significantly larger than the 8
times the high watermark we had before?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-27 18:52                   ` Andrea Arcangeli
  2011-01-27 20:33                     ` Rik van Riel
@ 2011-01-27 21:31                     ` Mel Gorman
  2011-01-27 23:18                       ` Rik van Riel
  1 sibling, 1 reply; 52+ messages in thread
From: Mel Gorman @ 2011-01-27 21:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Shaohua Li, Andrew Morton, Rik van Riel, linux-mm, Chen, Tim C

On Thu, Jan 27, 2011 at 07:52:15PM +0100, Andrea Arcangeli wrote:
> On Thu, Jan 27, 2011 at 04:03:01PM +0000, Mel Gorman wrote:
> > Agreed, I considered your approach as well. I didn't go with it because it
> > was the main heuristic that allowed kswapd to skip a zone but still allows
> > kswapd to keep going. I made the choice to try and put kswapd to sleep
> > sooner.
> 
> Ok, but a multiplication *8 remains excessive and while it may be ok
> with min_free_kbytes=20M it's not ok when it's = 80M, especially when
> it can be set to 80M on a 4G system that will end up with a small
> over-4g zone that may not be shrunk as easily as the normal/pci32 zone
> below 4g.
> 

Agreed on this front at least.

> It's broken because this *8 adds is all about a 7*highwmark "gap".
> 

The gap as a multiple is not so much as how much of a gap that works out
as being.

> I'm having a little trouble understanding your patch and I don't like
> the magic >> 2 very much, if the node has little more than 1/4th of

if the zone has little more than 1/4th I assume you mean.

> the memory of the node, it'll still cause the other zones to be shrunk
> 8 times more than they should ever be shrunk! This will materialize
> with ~mem=5g , with your patch a little more than 5g will still lead
> to ~800M free by mistake.

You're right that 5G would lead to the Normal zone being slightly above the
quarter mark. Initially I considered that a 1G zone would remain
balanced for long enough for kswapd to go to sleep but now that I
consider it more it's not safe. It might work on one machine and fail on
a faster on making it hard to pin down.

> It seems more a band aid for the 4g case
> than a real fix. This is why I think the real fix is to remove that *8
> and create a real "balance gap ratio" that is in function of the
> memory of the zone, not in function of the high wmark at all.
> 
> If we were using the old code the gap would be way smaller. The "gap"
> is increasing excessively because the "high wmark" is increasing to a
> fixed value in function of the pageblocks numbers, the migrate types
> etc..., but from an algorithm point of view the high wmark has no
> effect on the rotation of all lrus to balance the shrinking of all
> zones. The high wmark is a fixed amount for all zones, the "gap"
> doesn't need to increase with the high wmark.
> 

Ok, that would be a mild improvement but what value should that gap be?
If it's a plain percentage of the zone, it could still become an
extremely large value. Conceivably it would be better to rely on an
event from the page allocator. Specifically, if the allocator has not
complained that this node is under pressure recently as indicated from
calls to wakeup_kswapd() then stop reclaiming from any zone that meets
the watermark.

> Clearly the high wmark was used as in the old days it was a function
> of the ram size, now it's not anymore. So clearly the "gap" must not
> be in function of the high wmark a nymore but only in function of the
> memory size! Which I think is the real fix.
> 
> > It was introduced by commit [32a4330d: mm: prevent kswapd from freeing
> > excessive amounts of lowmem] and sure enough, it was intended to avoid a
> > situation where memory was freed from every zone if one was imbalanced -
> > sounds familiar.
> 
> Yes definitely. So it was limiting the waste to 8*high_wmark. But that
> was ok because it had the assumtion wmark was a fuction of memory,
> it's not ok anymore and we must make it a function of memory
> explicitly to fix this.
> 

hmm, admittedly a gap that was a function of memory would limit the damage
but it doesn't prevent a situation where a really small Normal zone can
prevent kswapd going to sleep. i.e. when I get to testing your patch
(hopefully tomorrow, tuesday at worst), I'll be looking for kswapd being
stuck in D state.

> > It should work in terms of free memory. When testing, monitor as well if
> > kswapd is going asleep or if it is stuck in D state. If it's stuck in D state,
> > it's looping around in balance_pgdat() and consuming CPU for no good reason
> > (can use vmscan tracepoints to confirm).
> 
> I'll try another patch first to avoid disabling the balancing of all
> zones that should provide for a nicer lru behavior than my previous
> patch.
> 
> I am however uncertain this is really better than removing the *8 as
> in my previous patch. But either this or previous patch I sent is the
> solution I prefer, because this fixes it without a magic >>2 that will
> break again quite badly at little more than mem=5g.
> 

Whatever the final solution, it both needs to prevent too much memory
being reclaimed and allow kswapd to go to sleep if there is no
indication from the page allocator that it should stay awake.

> ====
> Subject: vmscan: kswapd must not free more than high_wmark+gap pages
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> When the min_free_kbytes is set with `hugeadm
> --set-recommended-min_free_kbytes" or with THP enabled (which runs the
> equivalent of "hugeadm --set-recommended-min_free_kbytes" to activate
> anti-frag at full effectiveness automatically at boot) the high wmark
> of some zone is fixed as high as ~88M, not anymore in function of
> memory size. 88M free on a 4G system isn't horrible, but 88M*8 = 704M
> free on a 4G system is unbearable. This only tends to be visible on 4G

At the very least, we agree on what is causing this problem :)

> systems with tiny over-4g zone where kswapd insists to reach the high
> wmark on the over-4g zone but doing so it shrunk up to 704M from the
> normal zone by mistake. This patch makes the "gap" explicit in
> function of memory size, because the high wmark isn't in function of
> memory size anymore.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 4d55932..a57c6e7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -155,6 +155,15 @@ enum {
>  #define SWAP_CLUSTER_MAX 32
>  #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
>  
> +/*
> + * Ratio between the present memory in the zone and the "gap" that
> + * we're allowing kswapd to shrink in addition to the per-zone high
> + * wmark, even for zones that already have the high wmark satisfied,
> + * in order to provide better per-zone lru behavior. We are ok to
> + * spend not more than 1% of the memory for this zone balancing "gap".
> + */
> +#define KSWAPD_ZONE_BALANCE_GAP_RATIO 100
> +
>  #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
>  #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
>  #define SWAP_HAS_CACHE	0x40	/* Flag page is cached, in first swap_map */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f5d90de..f03441e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2403,11 +2403,16 @@ loop_again:
>  			mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);
>  
>  			/*
> -			 * We put equal pressure on every zone, unless one
> -			 * zone has way too many pages free already.
> +			 * We put equal pressure on every zone, unless
> +			 * one zone has way too many pages free
> +			 * already. The "too many pages" is defined
> +			 * as the high wmark plus a "gap".
>  			 */
>  			if (!zone_watermark_ok_safe(zone, order,
> -					8*high_wmark_pages(zone), end_zone, 0))
> +					(zone->present_pages +
> +					 KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
> +					 KSWAPD_ZONE_BALANCE_GAP_RATIO +
> +					high_wmark_pages(zone), end_zone, 0))

Rik has already pointed out that this potentially is a very large gap
but that is an addressable problem if the final decision goes this
direction.

>  				shrink_zone(priority, zone, &sc);
>  			reclaim_state->reclaimed_slab = 0;
>  			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
> 

-- 
Mel Gorman
Linux Technology Center
IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-27 21:31                     ` Mel Gorman
@ 2011-01-27 23:18                       ` Rik van Riel
  2011-01-28 10:35                         ` Mel Gorman
  0 siblings, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2011-01-27 23:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On 01/27/2011 04:31 PM, Mel Gorman wrote:

> Whatever the final solution, it both needs to prevent too much memory
> being reclaimed and allow kswapd to go to sleep if there is no
> indication from the page allocator that it should stay awake.

A third requirement:

If one zone has a lot lower memory pressure than another zone,
we want to do relatively more memory allocations from that zone,
than from a zone where the memory is heavily used.

If kswapd only ever goes up to the high watermark, and also uses
that as its sleep point, the allocations end up corresponding to
zone size alone and not to memory pressure.

Going a little bit above the high watermark (1% of zone size?
high + min watermark?) will help balance things out between zones.

>>   			if (!zone_watermark_ok_safe(zone, order,
>> -					8*high_wmark_pages(zone), end_zone, 0))
>> +					(zone->present_pages +
>> +					 KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
>> +					 KSWAPD_ZONE_BALANCE_GAP_RATIO +
>> +					high_wmark_pages(zone), end_zone, 0))
>
> Rik has already pointed out that this potentially is a very large gap
> but that is an addressable problem if the final decision goes this
> direction.

I was wrong.  I guess on some systems the min watermark can be less
than 1% and (high + min) may be better, but on most systems the
number of pages should be about the same.

Maybe we should use high_wmark_pages(zone) + low_wmark_pages(zone)
for easy readability?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-27 23:18                       ` Rik van Riel
@ 2011-01-28 10:35                         ` Mel Gorman
  2011-01-28 16:28                           ` Andrea Arcangeli
  0 siblings, 1 reply; 52+ messages in thread
From: Mel Gorman @ 2011-01-28 10:35 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On Thu, Jan 27, 2011 at 06:18:07PM -0500, Rik van Riel wrote:
> On 01/27/2011 04:31 PM, Mel Gorman wrote:
>
>> Whatever the final solution, it both needs to prevent too much memory
>> being reclaimed and allow kswapd to go to sleep if there is no
>> indication from the page allocator that it should stay awake.
>
> A third requirement:
>
> If one zone has a lot lower memory pressure than another zone,
> we want to do relatively more memory allocations from that zone,
> than from a zone where the memory is heavily used.
>

Risky. Allocations could end up using a lower zone than required causing
a form of lowmem pressure when highmem should have been used. Worse,
it'll be unnoticable on x86-64 but potentially cause problems on x86-32
that are easily missed.

> If kswapd only ever goes up to the high watermark, and also uses
> that as its sleep point, the allocations end up corresponding to
> zone size alone and not to memory pressure.
>

hmm.

> Going a little bit above the high watermark (1% of zone size?
> high + min watermark?) will help balance things out between zones.
>
>>>   			if (!zone_watermark_ok_safe(zone, order,
>>> -					8*high_wmark_pages(zone), end_zone, 0))
>>> +					(zone->present_pages +
>>> +					 KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
>>> +					 KSWAPD_ZONE_BALANCE_GAP_RATIO +
>>> +					high_wmark_pages(zone), end_zone, 0))
>>
>> Rik has already pointed out that this potentially is a very large gap
>> but that is an addressable problem if the final decision goes this
>> direction.
>
> I was wrong.  I guess on some systems the min watermark can be less
> than 1% and (high + min) may be better, but on most systems the
> number of pages should be about the same.
>
> Maybe we should use high_wmark_pages(zone) + low_wmark_pages(zone)
> for easy readability?
>

I'd be ok with high+low as a starting point to solve the immediate
problem of way too much memory being free and then treat "kswapd must go
to sleep" as a separate problem. I'm less keen on 1% but only because it
could be too large a value.

-- 
Mel Gorman
Linux Technology Center
IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-28 10:35                         ` Mel Gorman
@ 2011-01-28 16:28                           ` Andrea Arcangeli
  2011-01-28 16:46                             ` Mel Gorman
  2011-01-28 17:10                             ` Rik van Riel
  0 siblings, 2 replies; 52+ messages in thread
From: Andrea Arcangeli @ 2011-01-28 16:28 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Rik van Riel, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On Fri, Jan 28, 2011 at 10:35:39AM +0000, Mel Gorman wrote:
> I'd be ok with high+low as a starting point to solve the immediate
> problem of way too much memory being free and then treat "kswapd must go
> to sleep" as a separate problem. I'm less keen on 1% but only because it
> could be too large a value.

min(1%, low) sounds best to me. Because on the 4G system "low" is likely
bigger than 1%.

But really to me it sounds best to apply my first patch and stick to
the high watermark and remove the gap.

What is going on is dma zone and pci32 zones are at high+gap. over-4g
zone is at "high". kswapd keeps running until all are above high. But
as long as there's at least one not over high, the others are shrunk
up to high+gap.

The allocator is tought that it should try to always allocate from the
over4g zone. And the over-4g zone is never below the "low" wmark
because 100% of the cache is clean so kswapd keeps the normal and dma
zones at high+gap and the over-4g zone at "high".

In previous email you asked me how kswapd get stuck in D state and
never stops working, and that it should stop earlier. This sounds
impossible, kswapd behavior can't possibly change, simply there is
less memory freed by lowering that "gap". Also you can make the gap as
big as you want but it'll only make a difference the first time, then
kswapd will stop shrinking normal and dma zone when they reach
high+gap. Regardless of the gap size. So kswapd can't possibly change
behavior and it can't possibly be in D state by just changing this
"gap" size. Which is why I think the gap should be zero and I'd like
my first patch to be applied. There's no point to waste ram for a
feature that can't gaurantee we rotate the zone allocation.

The balancing problem can't be solved in kswapd. It can only be solved
in the allocator if you really aim to give more rotation to the
lrus. As long as the "over4g" zone will be allocated first, at some
point the lrus in the normal/dma zone will have to stop
rotating. Either that or kswapd will shrink 100% of the ram in
dma/normal zone which would destroy all the cache which is clearly
wrong.

And if you change the allocator to allocate in rotation from the 3
zones (clearly we would never want to allocate from the dma zone, so
it's magic area here) there is absolutely no need of any "gap" in
kswapd to keep the shrinking balanced.

In short I think the zone balancing problem tackled in kswapd is wrong
and kswapd should stick to the high wmark only, and if you care about
zone balancing it should be done in the allocator only, then kswapd
will cope with whatever the allocator decides just fine.

I guess the LRU caching behavior of a 4g system with a little memory
over 4g is going to be worse than if you boot with mem=4g and there's
nothing kswapd can do about it as long as the allocator always grabs
the new cache page from the highest zone. Clearly on a 64bit system
allocating below 4g may be ok, but on 32bit system allocating in the
normal zone below 800m must be absolutely avoided. So it's not simple
problem. Personally I never liked per-zone lru because of this. But
kswapd isn't the solution and it just wastes memory with no benefit
possible except for the first 5sec when the free memory goes up from
170M to 700M and then it remains stuck at 700M while cp runs for
another 2 hours to read all 500G of hd.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-28 16:28                           ` Andrea Arcangeli
@ 2011-01-28 16:46                             ` Mel Gorman
  2011-01-28 17:16                               ` Rik van Riel
  2011-01-28 17:34                               ` Andrea Arcangeli
  2011-01-28 17:10                             ` Rik van Riel
  1 sibling, 2 replies; 52+ messages in thread
From: Mel Gorman @ 2011-01-28 16:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On Fri, Jan 28, 2011 at 05:28:31PM +0100, Andrea Arcangeli wrote:
> On Fri, Jan 28, 2011 at 10:35:39AM +0000, Mel Gorman wrote:
> > I'd be ok with high+low as a starting point to solve the immediate
> > problem of way too much memory being free and then treat "kswapd must go
> > to sleep" as a separate problem. I'm less keen on 1% but only because it
> > could be too large a value.
> 
> min(1%, low) sounds best to me. Because on the 4G system "low" is likely
> bigger than 1%.
> 

On a 4G system, sure. On a 16G system, the gap is larger than
min_free_kbytes. Granted, in that case it's less of a problem because we
don't have a small higher zone causing problems.

> But really to me it sounds best to apply my first patch and stick to
> the high watermark and remove the gap.
> 
> What is going on is dma zone and pci32 zones are at high+gap. over-4g
> zone is at "high". kswapd keeps running until all are above high. But
> as long as there's at least one not over high, the others are shrunk
> up to high+gap.
> 

Yep, this is why there is an excess of free memory and kswapd stuck in D state
as it's stuck in balance_pgdat().

> The allocator is tought that it should try to always allocate from the
> over4g zone. And the over-4g zone is never below the "low" wmark
> because 100% of the cache is clean so kswapd keeps the normal and dma
> zones at high+gap and the over-4g zone at "high".
> 

A consequence of this is that it's much harder for pages in a small high zone
to get old while kswapd stays awake. They get reclaimed far sooner than pages
in the Normal soon which no doubt leads to some unexpected slowdowns. It's
another reason why we should be making sure kswapd gets to sleep when
there is no pressure.

> In previous email you asked me how kswapd get stuck in D state and
> never stops working, and that it should stop earlier. This sounds
> impossible, kswapd behavior can't possibly change, simply there is
> less memory freed by lowering that "gap".

There might be less memory freed by lowering that gap but it still needs to
exit balance_pgdat() and go to sleep. Otherwise it'll keep freeing zones up
to the high watermark + gap and calling congestion_wait (hence the D state).

> Also you can make the gap as
> big as you want but it'll only make a difference the first time, then
> kswapd will stop shrinking normal and dma zone when they reach
> high+gap. Regardless of the gap size. So kswapd can't possibly change
> behavior and it can't possibly be in D state by just changing this
> "gap" size. Which is why I think the gap should be zero and I'd like
> my first patch to be applied. There's no point to waste ram for a
> feature that can't gaurantee we rotate the zone allocation.
> 

Ok, the gap idea will certainly work in that there will be less memory
freed. It's the first obvious problem and it's the best solution so far.
I will double check myself later if kswapd is stuck in D state due to looping
around balance_pgdat().

> The balancing problem can't be solved in kswapd. It can only be solved
> in the allocator if you really aim to give more rotation to the
> lrus. As long as the "over4g" zone will be allocated first, at some
> point the lrus in the normal/dma zone will have to stop
> rotating. Either that or kswapd will shrink 100% of the ram in
> dma/normal zone which would destroy all the cache which is clearly
> wrong.
> 
> And if you change the allocator to allocate in rotation from the 3
> zones (clearly we would never want to allocate from the dma zone, so
> it's magic area here) there is absolutely no need of any "gap" in
> kswapd to keep the shrinking balanced.
> 

Rotating through the zones is no problem to implement. The expected problem
is that allocations that could use HighMem or Normal instead use DMA32
potentially causing a request that requires DMA32 to fail later.

> In short I think the zone balancing problem tackled in kswapd is wrong
> and kswapd should stick to the high wmark only, and if you care about
> zone balancing it should be done in the allocator only, then kswapd
> will cope with whatever the allocator decides just fine.
> 

Potentially. We'd need to be careful that allocation requests are not getting
stalled but it's worth investigating.

> I guess the LRU caching behavior of a 4g system with a little memory
> over 4g is going to be worse than if you boot with mem=4g and there's
> nothing kswapd can do about it as long as the allocator always grabs
> the new cache page from the highest zone.

Agreed.

> Clearly on a 64bit system
> allocating below 4g may be ok, but on 32bit system allocating in the
> normal zone below 800m must be absolutely avoided. So it's not simple
> problem.

Exactly.

> Personally I never liked per-zone lru because of this. But
> kswapd isn't the solution and it just wastes memory with no benefit
> possible except for the first 5sec when the free memory goes up from
> 170M to 700M and then it remains stuck at 700M while cp runs for
> another 2 hours to read all 500G of hd.
> 

:/

-- 
Mel Gorman
Linux Technology Center
IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-28 16:46                             ` Mel Gorman
@ 2011-01-28 17:16                               ` Rik van Riel
  2011-01-28 17:46                                 ` Andrea Arcangeli
  2011-01-28 17:34                               ` Andrea Arcangeli
  1 sibling, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2011-01-28 17:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On 01/28/2011 11:46 AM, Mel Gorman wrote:
> On Fri, Jan 28, 2011 at 05:28:31PM +0100, Andrea Arcangeli wrote:

>> In previous email you asked me how kswapd get stuck in D state and
>> never stops working, and that it should stop earlier. This sounds
>> impossible, kswapd behavior can't possibly change, simply there is
>> less memory freed by lowering that "gap".
>
> There might be less memory freed by lowering that gap but it still needs to
> exit balance_pgdat() and go to sleep. Otherwise it'll keep freeing zones up
> to the high watermark + gap and calling congestion_wait (hence the D state).

The gap works because kswapd has different thresholds for
different things:

1) get woken up if every zone on an allocator's zone list
    is below the low watermark

2) exit the loop if _every_ zone is at or above the
    high watermark

3) skip a zone in the freeing loop if the zone has more
    than high + gap free memory

Continuing the loop as long as one zone is below the low
watermark is what equalizes memory pressure between zones.

Skipping the freeing of pages in a zone that already has
excessive amounts of free memory helps avoid memory waste
and excessive swapping.  We simply equalize the balance
between zones a little more slowly.  What matters is that
the memory pressure gets equalized over time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-28 17:16                               ` Rik van Riel
@ 2011-01-28 17:46                                 ` Andrea Arcangeli
  2011-01-28 18:03                                   ` Rik van Riel
  0 siblings, 1 reply; 52+ messages in thread
From: Andrea Arcangeli @ 2011-01-28 17:46 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On Fri, Jan 28, 2011 at 12:16:19PM -0500, Rik van Riel wrote:
> On 01/28/2011 11:46 AM, Mel Gorman wrote:
> > On Fri, Jan 28, 2011 at 05:28:31PM +0100, Andrea Arcangeli wrote:
> 
> >> In previous email you asked me how kswapd get stuck in D state and
> >> never stops working, and that it should stop earlier. This sounds
> >> impossible, kswapd behavior can't possibly change, simply there is
> >> less memory freed by lowering that "gap".
> >
> > There might be less memory freed by lowering that gap but it still needs to
> > exit balance_pgdat() and go to sleep. Otherwise it'll keep freeing zones up
> > to the high watermark + gap and calling congestion_wait (hence the D state).
> 
> The gap works because kswapd has different thresholds for
> different things:
> 
> 1) get woken up if every zone on an allocator's zone list
>     is below the low watermark
> 
> 2) exit the loop if _every_ zone is at or above the
>     high watermark
> 
> 3) skip a zone in the freeing loop if the zone has more
>     than high + gap free memory

Exactly.

> 
> Continuing the loop as long as one zone is below the low
> watermark is what equalizes memory pressure between zones.

I think you meant below high wmark here.

> Skipping the freeing of pages in a zone that already has
> excessive amounts of free memory helps avoid memory waste
> and excessive swapping.  We simply equalize the balance
> between zones a little more slowly.  What matters is that
> the memory pressure gets equalized over time.

The main problem I could see is for the lowmem reserve ratio. The only
real wmark that will be relevant to the allocator will be the one of
the "exact" zone asked to the allocator, not the below zones because
of the reserve ratio. So then kswapd will only satisfy the high wmark
from the view of the caller for the "exact" zone asked (not the below
zones that also must take the lowmem reserve ratio into
account). Which is enough but kswapd isn't helping the allocator for
the below zones. In any case the gap won't ever be as big as the
reserve ratio of the lower zones, so it can't solve this regardless
with the gap. Probably what we have right now is already optimal so to
put more shrinking pressure on the highest zone asked.

Overall I don't see the point of the gap as it's just like setting the
below zone wmark higher and I doubt it makes a significant balancing
difference. But hey I'm also ok to keep the gap above zero, I just
feel it's wasted memory. Surely it should be easy to prove it's wasted
memory for the "cp /dev/sda /dev/null" workload on a 4g system with a
little ram above 4g. For mixed workloads things are little more
interesting but I think on average it's not worth it.

My whole point in claiming it can't affect the balancing of the lrus,
is that the real lru rotation is entirely controlled by the
allocator. It doesn't matter if kswapd stops at high or high+gap, for
any zone at any time, as long as the allocator only allocates from one
zone or the other. And if the allocator allocates from all zones in a
perfectly balanced way, again kswapd will shrink in a perfectly
balanced way over time regardless of high or high+gap.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-28 17:46                                 ` Andrea Arcangeli
@ 2011-01-28 18:03                                   ` Rik van Riel
  2011-01-28 18:24                                     ` Andrea Arcangeli
  0 siblings, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2011-01-28 18:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On 01/28/2011 12:46 PM, Andrea Arcangeli wrote:

> My whole point in claiming it can't affect the balancing of the lrus,
> is that the real lru rotation is entirely controlled by the
> allocator. It doesn't matter if kswapd stops at high or high+gap, for
> any zone at any time, as long as the allocator only allocates from one
> zone or the other. And if the allocator allocates from all zones in a
> perfectly balanced way, again kswapd will shrink in a perfectly
> balanced way over time regardless of high or high+gap.

My point is, the behaviour you describe would be WRONG :)

The reason is that the different zones can contain data
that is either heavily used or rarely used, often some
mixture of the two, but sometimes the zones are out of
balance in how much the data in memory gets touched.

We need to reclaim and reuse the lightly used memory
a little faster than the heavily used memory, to even
out the memory pressure between zones.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-28 18:03                                   ` Rik van Riel
@ 2011-01-28 18:24                                     ` Andrea Arcangeli
  2011-01-28 19:34                                       ` Rik van Riel
  0 siblings, 1 reply; 52+ messages in thread
From: Andrea Arcangeli @ 2011-01-28 18:24 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On Fri, Jan 28, 2011 at 01:03:50PM -0500, Rik van Riel wrote:
> My point is, the behaviour you describe would be WRONG :)
> 
> The reason is that the different zones can contain data
> that is either heavily used or rarely used, often some
> mixture of the two, but sometimes the zones are out of
> balance in how much the data in memory gets touched.
> 
> We need to reclaim and reuse the lightly used memory
> a little faster than the heavily used memory, to even
> out the memory pressure between zones.

I've no idea how kswapd can reclaim the lightly used memory a little
faster when it blocks at high+gap. Unless the allocator is eating into
the gap, kswapd will be stuck at 700M free, and no rotation in the lru
will ever happen in the lower zones. You can't control it from kswapd
but only from the allocator and regardless the size of the gap the
rotation won't alter. As eventually in the "cp /dev/sda /dev/null"
example workload (but simulating what happens normally during any file
read) the "high+gap" will be reached in 5 sec then it'll be like if
there's no gap for the next 2 hours.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-28 18:24                                     ` Andrea Arcangeli
@ 2011-01-28 19:34                                       ` Rik van Riel
  2011-01-28 19:45                                         ` Andrea Arcangeli
  0 siblings, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2011-01-28 19:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On 01/28/2011 01:24 PM, Andrea Arcangeli wrote:
> On Fri, Jan 28, 2011 at 01:03:50PM -0500, Rik van Riel wrote:
>> My point is, the behaviour you describe would be WRONG :)
>>
>> The reason is that the different zones can contain data
>> that is either heavily used or rarely used, often some
>> mixture of the two, but sometimes the zones are out of
>> balance in how much the data in memory gets touched.
>>
>> We need to reclaim and reuse the lightly used memory
>> a little faster than the heavily used memory, to even
>> out the memory pressure between zones.
>
> I've no idea how kswapd can reclaim the lightly used memory a little
> faster when it blocks at high+gap.

It will block at high+gap only when one zone has really
easily reclaimable memory, and another zone has difficult
to free memory.

That creates a free memory differential between the
easy to free and difficult to free memory zones.

If memory in all zones is equally easy to free, kswapd
will go to sleep once the high watermark is reached in
every zone.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-28 19:34                                       ` Rik van Riel
@ 2011-01-28 19:45                                         ` Andrea Arcangeli
  2011-01-28 20:55                                           ` Rik van Riel
  0 siblings, 1 reply; 52+ messages in thread
From: Andrea Arcangeli @ 2011-01-28 19:45 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On Fri, Jan 28, 2011 at 02:34:31PM -0500, Rik van Riel wrote:
> It will block at high+gap only when one zone has really
> easily reclaimable memory, and another zone has difficult
> to free memory.

The other zone doesn't need to be difficult to free up. All ram in
immediately freeable clean cache is the most common case there is. And
it's more than enough to trigger the scenario in prev email.

> That creates a free memory differential between the
> easy to free and difficult to free memory zones.

There's no difficult to free zone in this scenario.

> If memory in all zones is equally easy to free, kswapd
> will go to sleep once the high watermark is reached in
> every zone.

Yes, at that point the high wmark is reached for all zones. Then cp or
any file read allocates another high-low amount of clean cache, and
kswapd will be waken again. Then when it goes to sleep the over4g tiny
zone will be at "high" again but the below zones will be at
high+(high_over4gwmark-low_over4gwmark), in about 5 seconds the over4g
zone will be at "high" and the other two zones will be at
"high+gap". All when there's zero memory pressure in the below zones,
and there's just some clean cache shrinking required to allocate the
new cache from the over4g zone. Then the below zones lru stops
rotating regardless of the size of the gap (0 or 600M makes no
difference).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-28 19:45                                         ` Andrea Arcangeli
@ 2011-01-28 20:55                                           ` Rik van Riel
  2011-01-29 19:45                                             ` Andrea Arcangeli
  0 siblings, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2011-01-28 20:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On 01/28/2011 02:45 PM, Andrea Arcangeli wrote:
> On Fri, Jan 28, 2011 at 02:34:31PM -0500, Rik van Riel wrote:
>> It will block at high+gap only when one zone has really
>> easily reclaimable memory, and another zone has difficult
>> to free memory.
>
> The other zone doesn't need to be difficult to free up. All ram in
> immediately freeable clean cache is the most common case there is. And
> it's more than enough to trigger the scenario in prev email.
>
>> That creates a free memory differential between the
>> easy to free and difficult to free memory zones.
>
> There's no difficult to free zone in this scenario.

In that case, every zone will go down to the low watermark
before kswapd is woken up.

At that point, kswapd will reclaim until every zone is at
the high watermark, and go back to sleep.

There is no "free up to high + gap" in your scenario.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-28 20:55                                           ` Rik van Riel
@ 2011-01-29 19:45                                             ` Andrea Arcangeli
  0 siblings, 0 replies; 52+ messages in thread
From: Andrea Arcangeli @ 2011-01-29 19:45 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On Fri, Jan 28, 2011 at 03:55:09PM -0500, Rik van Riel wrote:
> In that case, every zone will go down to the low watermark
> before kswapd is woken up.

This isn't what happens though, if that would be what happens, we
would see free memory going down back to ~130M and then up to 700M and
then down again to 130M, and not stuck at 700M at all times like
below. Example:

 0  0  70512 134940 379408 2753936    0    0   118    71    5    3  2  1 97  1
 0  0  70512 134808 379408 2753936    0    0     0     0   54   48  0  0 100  0
 0  1  70512 131228 383448 2753928    0    0  4160    68  149  172  0  0 99  1
 0  1  70512 276548 502184 2495564    0    0 118784    36 1357 2084  0  5 73 21
 1  1  70512 507932 624128 2151616    0    0 121984     0 1521 2166  0  6 77 17
 0  1  70512 699264 746484 1860468    0    0 122368     4 1443 2242  0  5 74 20
 0  1  70512 727040 865936 1722716    0    0 119552     0 1344 2194  0  5 75 21
 0  1  70512 733116 984396 1610292    0    0 118528     0 1311 2139  0  4 76 20
 1  0  70512 724064 1102864 1510256    0    0 118528     0 1302 2132  0  4 75 21
 1  0  70512 728900 1224312 1394328    0    0 121472     0 1395 2168  0  4 77 19
 1  0  70512 733736 1337224 1286852    0    0 115840    40 1404 2074  0  4 74 22

> At that point, kswapd will reclaim until every zone is at
> the high watermark, and go back to sleep.
> 
> There is no "free up to high + gap" in your scenario.

Well there clearly is from vmstat... I think you should be able to
reproduce if you boot with something like mem=4200m or so, workload is
simple "cp /dev/sda /dev/null".

Maybe we're waking kswapd too soon. But kswapd definitely goes to
sleep, infact it sleeps most of the time and it runs every once in a
while and it's unclear why the free memory never reaches back the 130M
level that it usually sits when there's no intensive read I/O like
shown above. For now, given what I see, I have to assume kswapd is
waken too soon, and not only when all wmarks reach low or the free
memory wouldn't be stuck at ~700M at all times while cp runs.

If kswapd is wakenup too soon, to me that is a separate problem and I
still don't see a significant benefit of having any "gap" bigger than
"high-low" there...

Like you said kswapd shouldn't run until we hit the low wmark again on
all zones, and I think that's more than enough without more "gap" than
the already available default "high-low" gap for the lower zones. If
the zone is bigger (like the below4g zone above) the wmark will be
bigger relative to the other zones. So when kswapd is wakenup because
all zones reach low wmark (we agree this is what should happen even if
it doesn't look like it's working right with "cp"), assuming all cache
is clean and immediately freeable kswapd will have to invoke
shrink_cache more times for the below4g zone. This "gap" added to
"high-low" will make the above4g lru rotate more times than needed to
reach the high wmark. But we allocated only "high-low" amount of cache
in the above4g zone lru. So I'm not sure if shrinking more than
"high-low" from it is right even from a balancing prospective in the
absolute trivial case of just 1 wakeup every time all zones hits the
low wmark.

At the same time if kswapd frees memory at the same rate that an
over4g allocator is allocating it, kswapd won't go to sleep and there
will be no rotation in the below4g lru at all. This is similar of what
we see above in fact, except for me kswapd goes to sleep because cp
isn't fast enough but a page fault could trigger it and prevent the
lru of the lower zones to ever rotate (simulating a kswapd wakeup too
soon, by just not making kswapd go to sleep and keeping hitting on the
high-low range on the over4g zone). So you see, there is no real
reliable way to have balancing guarantees from kswapd, and for the
trivial case where there is no concurrency between allocator and
kswapd freeing, rotating more the tiny above4g lru than "high-low"
despite we only allocated "high-low" cache into it doesn't sound
obviously right either. Bigger gap to me looks like will do more harm
than good and if we need a real guarantee of balancing we should
rotate the allocations across the zones (bigger lru in a zone will
require it to be hit more frequently because it'll rotate slower than
the other zones, the bias should not even dependent on the zone size
but on the lru size).

So for now it's all statistical but I doubt the "gap" shrunk in
addition of the "high-low" cache max allocated, is providing benefit.

Even in the non racing case all I can see is the smaller zones
(satisfying the "high" wmark faster than the bigger zones) (and the
smaller zones statistically should get a smaller lru too) being
lru-rotated way more than their small "high-low". Smaller zone should
be rotated in proportion of their small "high-low" only, and not
potentially as big as the biggest "high-low" for the biggest zone.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-28 16:46                             ` Mel Gorman
  2011-01-28 17:16                               ` Rik van Riel
@ 2011-01-28 17:34                               ` Andrea Arcangeli
  1 sibling, 0 replies; 52+ messages in thread
From: Andrea Arcangeli @ 2011-01-28 17:34 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Rik van Riel, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On Fri, Jan 28, 2011 at 04:46:24PM +0000, Mel Gorman wrote:
> On Fri, Jan 28, 2011 at 05:28:31PM +0100, Andrea Arcangeli wrote:
> > On Fri, Jan 28, 2011 at 10:35:39AM +0000, Mel Gorman wrote:
> > > I'd be ok with high+low as a starting point to solve the immediate
> > > problem of way too much memory being free and then treat "kswapd must go
> > > to sleep" as a separate problem. I'm less keen on 1% but only because it
> > > could be too large a value.
> > 
> > min(1%, low) sounds best to me. Because on the 4G system "low" is likely
> > bigger than 1%.
> > 
> 
> On a 4G system, sure. On a 16G system, the gap is larger than
> min_free_kbytes. Granted, in that case it's less of a problem because we
> don't have a small higher zone causing problems.

Agreed, there I also prefer the low wmark ;).

> Yep, this is why there is an excess of free memory and kswapd stuck in D state
> as it's stuck in balance_pgdat().

kswapd in the "cp /dev/sda /dev/null" workload can't possibly be stuck
in D state at any given tiem. There's no I/O it has to do, it's 100%
clean cache. It's always in S or R state. But every time it gets waken
up when the over4g zone hits the low wmark, it shrinks the over4g
until it's over "high" and also until all below zones are
"high+gap". So in 5 sec what happens is the other zones are stuck at
"high+gap" and it stops shrinking them forever, and it only keeps the
over-4g zone from "low" to "high", because the allocator picks always
from the over4g zone.

> A consequence of this is that it's much harder for pages in a small high zone
> to get old while kswapd stays awake. They get reclaimed far sooner than pages
> in the Normal soon which no doubt leads to some unexpected slowdowns. It's
> another reason why we should be making sure kswapd gets to sleep when
> there is no pressure.

The problem it's not kswapd, it's the allocator. There's nothing
kswapd can do about it. kswapd has no fatigue in shrinking any zone,
it's all 100% clean immediately reclaimable cache, we could shrink it
even from GFP_ATOMIC context from irq (just not nmi) if we wanted.

> There might be less memory freed by lowering that gap but it still needs to
> exit balance_pgdat() and go to sleep. Otherwise it'll keep freeing zones up
> to the high watermark + gap and calling congestion_wait (hence the D state).

I just can't see how the size of the "gap" can make any difference, 0
gap or 1g gap, the only thing that will change is the amount of memory
free you see, the kswapd state not.

> Ok, the gap idea will certainly work in that there will be less memory
> freed. It's the first obvious problem and it's the best solution so far.
> I will double check myself later if kswapd is stuck in D state due to looping
> around balance_pgdat().

I'll check that too, but I don't see how the gap can affect that.

Setting the gap to 600M with high set to 100M, is like setting high to
700M manually for that zone and eliminate the gap. Only thing that
changes is the behavior of min_free_kbytes.

> Rotating through the zones is no problem to implement. The expected problem
> is that allocations that could use HighMem or Normal instead use DMA32
> potentially causing a request that requires DMA32 to fail later.

Exactly. Note the lowmem reserve ratio algorithm exists exactly to
reserve a portion of memory to the users of the lowmem
zones. Otherwise things go bad when all memory is free. So thanks to
the lowmem reserve ratio algorithm, it's less of an issue to rotate
across the zones. But it's a separate issue.

> > I guess the LRU caching behavior of a 4g system with a little memory
> > over 4g is going to be worse than if you boot with mem=4g and there's
> > nothing kswapd can do about it as long as the allocator always grabs
> > the new cache page from the highest zone.
> 
> Agreed.
> 
> > Clearly on a 64bit system
> > allocating below 4g may be ok, but on 32bit system allocating in the
> > normal zone below 800m must be absolutely avoided. So it's not simple
> > problem.
> 
> Exactly.

Full agreement here.

As said above it is very possible the lowmem reserve ratio is enough
and we can now rotate freely across the zones. The lowmem reserve
ratio is already tuned in a way that on a 32G x86_32 all the normal
zone will be forbidden. It scales down as the ratio between the
highemm vs normal zone goes down. On a 1g system most of the normal
zone becomes available also for highmem allocations. It's made exactly
for that.

If we want to tackle this later we can and we can try to depend
entirely on the lowmem reserve ratio to do the right thing at
allocation time by making all wmark variable depending on who's
allocating what, but kswapd should just stick to "high" IMHO and gap
0.

However if I'm proven wrong then I'm also ok with min(1%, low), no
problem with me. Once we fix this (either with gap 0 or gap
min(1%,low)), running -set-recommended-min_free_kbytes should lead to
less memory wasted (in the 4g setup with a little memory over 4g) then
before running -set-recommended-min_free_kbytes at boot.

> > Personally I never liked per-zone lru because of this. But
> > kswapd isn't the solution and it just wastes memory with no benefit
> > possible except for the first 5sec when the free memory goes up from
> > 170M to 700M and then it remains stuck at 700M while cp runs for
> > another 2 hours to read all 500G of hd.
> > 
> 
> :/

;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-28 16:28                           ` Andrea Arcangeli
  2011-01-28 16:46                             ` Mel Gorman
@ 2011-01-28 17:10                             ` Rik van Riel
  1 sibling, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2011-01-28 17:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On 01/28/2011 11:28 AM, Andrea Arcangeli wrote:

> In short I think the zone balancing problem tackled in kswapd is wrong
> and kswapd should stick to the high wmark only, and if you care about
> zone balancing it should be done in the allocator only, then kswapd
> will cope with whatever the allocator decides just fine.

The allocator does not have information on which memory
zones have more heavily used data vs which zones have
less frequently used data.

When the system starts up, we do our initial allocations
in the top zone.  This includes both heavily used files
(like libc) and never-used-again files, as well as daemons
that are active and daemons that go to sleep and never do
anything again.

After initial startup, we may eventually end up falling
back to lower memory zones.

In short, we may have an imbalance between the zones in
how actively memory is used, from the moment the system
has started up.

The distance between the low and high watermarks
corresponds only to the relative size of each zone.

Having kswapd move only between these two watermarks
means that memory in each zone is allocated and freed
only according to zone size, not according to how
actively used the memory in each zone is.

Giving kswapd a little bit of extra room where it
is allowed to extra free pages in a zone with lots of
infrequently used and easily reclaimable pages, when
another zone in the same node suffers from harder to
deal with memory pressure, will steer more allocations
towards the memory zone that has less pressure.

This should even out the pressure between zones over
time.

We have had the kernel work like this since 2.6.0, and
I believe that removing this "pressure valve" from the
VM will result in the kind of balancing problems we had
in some 2.4 kernels.

Reducing the size of the gap is fine with me, since
the pressure should even out over time.  Removing the
gap is just asking for trouble.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-27 15:27               ` Andrea Arcangeli
  2011-01-27 16:03                 ` Mel Gorman
@ 2011-02-03  2:58                 ` Andrea Arcangeli
  2011-02-03 13:15                   ` Mel Gorman
                                     ` (2 more replies)
  2011-02-12  9:48                 ` alex shi
  2 siblings, 3 replies; 52+ messages in thread
From: Andrea Arcangeli @ 2011-02-03  2:58 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel

On Thu, Jan 27, 2011 at 04:27:55PM +0100, Andrea Arcangeli wrote:
> totally untested... I will test....

The below patch is fixing my problem and working fine for me... as
expected it can't possibly lead to any D state, it's pretty much like
setting min_free_kbytes lower, and it's not going to alter anything
other than the levels of free memory kept by kswapd.

$ while :; do ps xa|grep [k]swapd; sleep 1; done
  452 ?        R      1:20 [kswapd0]
  452 ?        S      1:20 [kswapd0]
  452 ?        S      1:20 [kswapd0]
  452 ?        S      1:20 [kswapd0]
  452 ?        S      1:20 [kswapd0]
  452 ?        R      1:20 [kswapd0]
  452 ?        R      1:20 [kswapd0]
  452 ?        R      1:20 [kswapd0]
  452 ?        R      1:20 [kswapd0]
  452 ?        S      1:20 [kswapd0]
  452 ?        R      1:20 [kswapd0]
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system--
  ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us
  sy id wa
 2  1   1784 111040 2393336 807924    0    0    63   992   56   70  1   1 96  2
 0  1   1784 108928 2402556 801864    0    0 122624     0 1619 2150  0   5 80 16
 0  1   1784 110664 2401244 801140    0    0 122496     0 1602 2081  0   3 81 16
 0  1   1784 109796 2410184 792984    0    0 122752     0 1685 2149  0   4 80 16
 0  1   1784 110416 2411856 791208    0    0 120448     4 1599 2075  0   4 81 16
 1  0   1784 113516 2415344 785336    0    0 122496     0 1636 2125  0   4 81 15

I doubt we'll get any regression because of the below (see also my
prev email in this thread), and I would only expect more cache and
maybe better lru. Previously the free memory levels were stuck at
~700M now they're stuck at the right level for a 4G system with THP on
(I'd still like to try to reduce the requirements only 1 hugepage for
each migratetype in the set_min_free_kbytes to reduce the requirements
to the minium, but only if possible..). But this saves 600M over 4G so
it's the highest prio to address.

Comments welcome,
Thanks!
Andrea

> ====
> Subject: vmscan: kswapd must not free more than high_wmark pages
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> When the min_free_kbytes is set with `hugeadm
> --set-recommended-min_free_kbytes" or with THP enabled (which runs the
> equivalent of "hugeadm --set-recommended-min_free_kbytes" to activate
> anti-frag at full effectiveness automatically at boot) the high wmark
> of some zone is as high as ~88M. 88M free on a 4G system isn't
> horrible, but 88M*8 = 704M free on a 4G system is definitely
> unbearable. This only tends to be visible on 4G systems with tiny
> over-4g zone where kswapd insists to reach the high wmark on the
> over-4g zone but doing so it shrunk up to 704M from the normal zone by
> mistake.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f5d90de..9e3c78e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2407,7 +2407,7 @@ loop_again:
>  			 * zone has way too many pages free already.
>  			 */
>  			if (!zone_watermark_ok_safe(zone, order,
> -					8*high_wmark_pages(zone), end_zone, 0))
> +					high_wmark_pages(zone), end_zone, 0))
>  				shrink_zone(priority, zone, &sc);
>  			reclaim_state->reclaimed_slab = 0;
>  			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-03  2:58                 ` Andrea Arcangeli
@ 2011-02-03 13:15                   ` Mel Gorman
  2011-02-03 18:59                     ` Andrea Arcangeli
  2011-02-03 14:36                   ` Rik van Riel
  2011-02-14  2:25                   ` Shaohua Li
  2 siblings, 1 reply; 52+ messages in thread
From: Mel Gorman @ 2011-02-03 13:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel

On Thu, Feb 03, 2011 at 03:58:08AM +0100, Andrea Arcangeli wrote:
> On Thu, Jan 27, 2011 at 04:27:55PM +0100, Andrea Arcangeli wrote:
> > totally untested... I will test....
> 
> The below patch is fixing my problem and working fine for me... as
> expected it can't possibly lead to any D state, it's pretty much like
> setting min_free_kbytes lower, and it's not going to alter anything
> other than the levels of free memory kept by kswapd.
> 
> $ while :; do ps xa|grep [k]swapd; sleep 1; done
>   452 ?        R      1:20 [kswapd0]
>   452 ?        S      1:20 [kswapd0]
>   452 ?        S      1:20 [kswapd0]
>   452 ?        S      1:20 [kswapd0]
>   452 ?        S      1:20 [kswapd0]
>   452 ?        R      1:20 [kswapd0]
>   452 ?        R      1:20 [kswapd0]
>   452 ?        R      1:20 [kswapd0]
>   452 ?        R      1:20 [kswapd0]
>   452 ?        S      1:20 [kswapd0]
>   452 ?        R      1:20 [kswapd0]

I got a chance to test this today and I see similar results. I still do see
kswapd entering D state occasionally and I'm convinced it's because it's
calling congestion_wait() i.e. it's not real IO but it's being accounted
for as an IO-related wait. That said, it's mostly asleep (S) or running (R)
and free memory is at reasonable levels so it's a big improvement.

> $ vmstat 1
> procs -----------memory---------- ---swap-- -----io---- -system--
>   ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us
>   sy id wa
>  2  1   1784 111040 2393336 807924    0    0    63   992   56   70  1   1 96  2
>  0  1   1784 108928 2402556 801864    0    0 122624     0 1619 2150  0   5 80 16
>  0  1   1784 110664 2401244 801140    0    0 122496     0 1602 2081  0   3 81 16
>  0  1   1784 109796 2410184 792984    0    0 122752     0 1685 2149  0   4 80 16
>  0  1   1784 110416 2411856 791208    0    0 120448     4 1599 2075  0   4 81 16
>  1  0   1784 113516 2415344 785336    0    0 122496     0 1636 2125  0   4 81 15
> 
> I doubt we'll get any regression because of the below (see also my
> prev email in this thread), and I would only expect more cache and
> maybe better lru. Previously the free memory levels were stuck at
> ~700M now they're stuck at the right level for a 4G system with THP on
> (I'd still like to try to reduce the requirements only 1 hugepage for
> each migratetype in the set_min_free_kbytes to reduce the requirements
> to the minium, but only if possible..). But this saves 600M over 4G so
> it's the highest prio to address.
> 
> Comments welcome,

I think this is the best direction to take for the moment to close the obvious
bug. More thought is required on when exactly kswapd is going to sleep and
on what zones the allocator should be using but there is no quick answer that
will simply have other consequences. As much as I'd like to investigate this
further now, I'm in the process of changing jobs and expect to be heavily
disrupted for at least a month during the changeover. So, for this;

Reviewed-and-tested-by: Mel Gorman <mel@csn.ul.ie>

> Thanks!
> Andrea
> 
> > ====
> > Subject: vmscan: kswapd must not free more than high_wmark pages
> > 
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > When the min_free_kbytes is set with `hugeadm
> > --set-recommended-min_free_kbytes" or with THP enabled (which runs the
> > equivalent of "hugeadm --set-recommended-min_free_kbytes" to activate
> > anti-frag at full effectiveness automatically at boot) the high wmark
> > of some zone is as high as ~88M. 88M free on a 4G system isn't
> > horrible, but 88M*8 = 704M free on a 4G system is definitely
> > unbearable. This only tends to be visible on 4G systems with tiny
> > over-4g zone where kswapd insists to reach the high wmark on the
> > over-4g zone but doing so it shrunk up to 704M from the normal zone by
> > mistake.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> > 
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index f5d90de..9e3c78e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2407,7 +2407,7 @@ loop_again:
> >  			 * zone has way too many pages free already.
> >  			 */
> >  			if (!zone_watermark_ok_safe(zone, order,
> > -					8*high_wmark_pages(zone), end_zone, 0))
> > +					high_wmark_pages(zone), end_zone, 0))
> >  				shrink_zone(priority, zone, &sc);
> >  			reclaim_state->reclaimed_slab = 0;
> >  			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
> > 
> > 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

-- 
Mel Gorman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-03 13:15                   ` Mel Gorman
@ 2011-02-03 18:59                     ` Andrea Arcangeli
  0 siblings, 0 replies; 52+ messages in thread
From: Andrea Arcangeli @ 2011-02-03 18:59 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel

On Thu, Feb 03, 2011 at 01:15:49PM +0000, Mel Gorman wrote:
> I got a chance to test this today and I see similar results. I still do see
> kswapd entering D state occasionally and I'm convinced it's because it's
> calling congestion_wait() i.e. it's not real IO but it's being accounted
> for as an IO-related wait. That said, it's mostly asleep (S) or running (R)
> and free memory is at reasonable levels so it's a big improvement.

I never seen it in D state here but maybe it happens
occasionally and I would expect the R/S/D states not to be altered by
this change, just the free levels should be altered.

> I think this is the best direction to take for the moment to close the obvious
> bug. More thought is required on when exactly kswapd is going to sleep and
> on what zones the allocator should be using but there is no quick answer that
> will simply have other consequences. As much as I'd like to investigate this
> further now, I'm in the process of changing jobs and expect to be heavily
> disrupted for at least a month during the changeover. So, for this;

I full agree we should check (with less hurry) exactly when kswapd is
going to sleep in this load in case it's waken too early. I expect it
will remain an independent issue and I don't expect this patch having
to be reversed once we figure why free levels stays always at "high"
and we don't see them reaching "low".

Thanks for the review,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-03  2:58                 ` Andrea Arcangeli
  2011-02-03 13:15                   ` Mel Gorman
@ 2011-02-03 14:36                   ` Rik van Riel
  2011-02-03 19:11                     ` Andrea Arcangeli
  2011-02-14  2:25                   ` Shaohua Li
  2 siblings, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2011-02-03 14:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On 02/02/2011 09:58 PM, Andrea Arcangeli wrote:

> Comments welcome,
> Thanks!
> Andrea
>
>> ====
>> Subject: vmscan: kswapd must not free more than high_wmark pages

NAK

I believe we need a little bit of slack above high_wmark_pages,
to be able to even out memory pressure between zones.

Maybe free up to high_wmark_pages + min_wmark_pages ?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-03 14:36                   ` Rik van Riel
@ 2011-02-03 19:11                     ` Andrea Arcangeli
  2011-02-12  1:28                       ` Simon Kirby
  0 siblings, 1 reply; 52+ messages in thread
From: Andrea Arcangeli @ 2011-02-03 19:11 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C

On Thu, Feb 03, 2011 at 09:36:47AM -0500, Rik van Riel wrote:
> On 02/02/2011 09:58 PM, Andrea Arcangeli wrote:
> 
> > Comments welcome,
> > Thanks!
> > Andrea
> >
> >> ====
> >> Subject: vmscan: kswapd must not free more than high_wmark pages
> 
> NAK
> 
> I believe we need a little bit of slack above high_wmark_pages,
> to be able to even out memory pressure between zones.
> 
> Maybe free up to high_wmark_pages + min_wmark_pages ?

If this can only go in with high+min that's still better than *8, but
in prev email on this thread I explained why I think it's not
beneficial for lru balancing and this level can't affect kswapd wakeup
times either, so I personally prefer just "high". I don't think out of
memory has anything to do with this the "min" level is all about the
PF_MEMALLOC and OOM levels. The zone balancing as well has nothing to
do with this and the only "hard" thing that guarantees balancing is
the lowmem reserve ratio (high ptes allocated in lowmem zones aren't
relocatable etc..).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-03 19:11                     ` Andrea Arcangeli
@ 2011-02-12  1:28                       ` Simon Kirby
  0 siblings, 0 replies; 52+ messages in thread
From: Simon Kirby @ 2011-02-12  1:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Mel Gorman, Shaohua Li, Andrew Morton, linux-mm,
	Chen, Tim C

On Thu, Feb 03, 2011 at 08:11:57PM +0100, Andrea Arcangeli wrote:

> On Thu, Feb 03, 2011 at 09:36:47AM -0500, Rik van Riel wrote:
> > On 02/02/2011 09:58 PM, Andrea Arcangeli wrote:
> > 
> > >> Subject: vmscan: kswapd must not free more than high_wmark pages
> > 
> > NAK
> > 
> > I believe we need a little bit of slack above high_wmark_pages,
> > to be able to even out memory pressure between zones.
> > 
> > Maybe free up to high_wmark_pages + min_wmark_pages ?
> 
> If this can only go in with high+min that's still better than *8, but
> in prev email on this thread I explained why I think it's not
> beneficial for lru balancing and this level can't affect kswapd wakeup
> times either, so I personally prefer just "high". I don't think out of
> memory has anything to do with this the "min" level is all about the
> PF_MEMALLOC and OOM levels. The zone balancing as well has nothing to
> do with this and the only "hard" thing that guarantees balancing is
> the lowmem reserve ratio (high ptes allocated in lowmem zones aren't
> relocatable etc..).

I was proposing before that the allocator fast path should use a weighted
(by zone size) round robin approach to the available zones, rather than
allocating from top down, so that reclaim would be fair rather than small
zones reclaiming stuff earlier than larger zones.

Riel pointed out that this 8*high_wmark_pages thing helped free a
proportional amount of stuff from the zone once the high_wmark was
breached, eventually causing allocation rates for each zone to end up
being close to the actual size of the zone. This happens because the
watermark values are set based on the size of the zone.

I still think this approach is a bit odd, since when kswapd first wakes
up, systems with multiple zones will reclaim things that aren't as old as
the stuff in the highest zone, until the system runs for a while and this
watermark thing balances the allocation rates. OTOH, changing the
allocator increases the possibility of some high-order DMA zone
allocation failing during boot that otherwise wouldn't.

Simon-

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-03  2:58                 ` Andrea Arcangeli
  2011-02-03 13:15                   ` Mel Gorman
  2011-02-03 14:36                   ` Rik van Riel
@ 2011-02-14  2:25                   ` Shaohua Li
  2011-02-22 14:25                     ` Mel Gorman
  2 siblings, 1 reply; 52+ messages in thread
From: Shaohua Li @ 2011-02-14  2:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel, alex.shi

On Thu, Feb 03, 2011 at 10:58:08AM +0800, Andrea Arcangeli wrote:
> On Thu, Jan 27, 2011 at 04:27:55PM +0100, Andrea Arcangeli wrote:
> > totally untested... I will test....
> 
> The below patch is fixing my problem and working fine for me... as
> expected it can't possibly lead to any D state, it's pretty much like
> setting min_free_kbytes lower, and it's not going to alter anything
> other than the levels of free memory kept by kswapd.
> 
> $ while :; do ps xa|grep [k]swapd; sleep 1; done
>   452 ?        R      1:20 [kswapd0]
>   452 ?        S      1:20 [kswapd0]
>   452 ?        S      1:20 [kswapd0]
>   452 ?        S      1:20 [kswapd0]
>   452 ?        S      1:20 [kswapd0]
>   452 ?        R      1:20 [kswapd0]
>   452 ?        R      1:20 [kswapd0]
>   452 ?        R      1:20 [kswapd0]
>   452 ?        R      1:20 [kswapd0]
>   452 ?        S      1:20 [kswapd0]
>   452 ?        R      1:20 [kswapd0]
> $ vmstat 1
> procs -----------memory---------- ---swap-- -----io---- -system--
>   ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us
>   sy id wa
>  2  1   1784 111040 2393336 807924    0    0    63   992   56   70  1   1 96  2
>  0  1   1784 108928 2402556 801864    0    0 122624     0 1619 2150  0   5 80 16
>  0  1   1784 110664 2401244 801140    0    0 122496     0 1602 2081  0   3 81 16
>  0  1   1784 109796 2410184 792984    0    0 122752     0 1685 2149  0   4 80 16
>  0  1   1784 110416 2411856 791208    0    0 120448     4 1599 2075  0   4 81 16
>  1  0   1784 113516 2415344 785336    0    0 122496     0 1636 2125  0   4 81 15
> 
> I doubt we'll get any regression because of the below (see also my
> prev email in this thread), and I would only expect more cache and
> maybe better lru. Previously the free memory levels were stuck at
> ~700M now they're stuck at the right level for a 4G system with THP on
> (I'd still like to try to reduce the requirements only 1 hugepage for
> each migratetype in the set_min_free_kbytes to reduce the requirements
> to the minium, but only if possible..). But this saves 600M over 4G so
> it's the highest prio to address.
Sorry for the later response, I offlined several weeks.
The patch is addressing the 8*high_wmark issue, which isn't the original issue
I reported (sure the 8*wmark issue should be fixed too).
min_free_kbytes is set higher and cause more pages freed even no the 8*wmark
issue. wmark:
before: min      1424
after:	min      11178
in our test, there is about 50M memory free (originally just about 5M, which
will cause more swap. Should we also reduce the min_free_kbytes?

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-14  2:25                   ` Shaohua Li
@ 2011-02-22 14:25                     ` Mel Gorman
  2011-02-22 14:42                       ` Andrea Arcangeli
  2011-02-23  5:29                       ` Shaohua Li
  0 siblings, 2 replies; 52+ messages in thread
From: Mel Gorman @ 2011-02-22 14:25 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Chen, Tim C,
	Rik van Riel, alex.shi

On Mon, Feb 14, 2011 at 10:25:24AM +0800, Shaohua Li wrote:
> On Thu, Feb 03, 2011 at 10:58:08AM +0800, Andrea Arcangeli wrote:
> > On Thu, Jan 27, 2011 at 04:27:55PM +0100, Andrea Arcangeli wrote:
> > > totally untested... I will test....
> > 
> > The below patch is fixing my problem and working fine for me... as
> > expected it can't possibly lead to any D state, it's pretty much like
> > setting min_free_kbytes lower, and it's not going to alter anything
> > other than the levels of free memory kept by kswapd.
> > 
> > $ while :; do ps xa|grep [k]swapd; sleep 1; done
> >   452 ?        R      1:20 [kswapd0]
> >   452 ?        S      1:20 [kswapd0]
> >   452 ?        S      1:20 [kswapd0]
> >   452 ?        S      1:20 [kswapd0]
> >   452 ?        S      1:20 [kswapd0]
> >   452 ?        R      1:20 [kswapd0]
> >   452 ?        R      1:20 [kswapd0]
> >   452 ?        R      1:20 [kswapd0]
> >   452 ?        R      1:20 [kswapd0]
> >   452 ?        S      1:20 [kswapd0]
> >   452 ?        R      1:20 [kswapd0]
> > $ vmstat 1
> > procs -----------memory---------- ---swap-- -----io---- -system--
> >   ----cpu----
> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us
> >   sy id wa
> >  2  1   1784 111040 2393336 807924    0    0    63   992   56   70  1   1 96  2
> >  0  1   1784 108928 2402556 801864    0    0 122624     0 1619 2150  0   5 80 16
> >  0  1   1784 110664 2401244 801140    0    0 122496     0 1602 2081  0   3 81 16
> >  0  1   1784 109796 2410184 792984    0    0 122752     0 1685 2149  0   4 80 16
> >  0  1   1784 110416 2411856 791208    0    0 120448     4 1599 2075  0   4 81 16
> >  1  0   1784 113516 2415344 785336    0    0 122496     0 1636 2125  0   4 81 15
> > 
> > I doubt we'll get any regression because of the below (see also my
> > prev email in this thread), and I would only expect more cache and
> > maybe better lru. Previously the free memory levels were stuck at
> > ~700M now they're stuck at the right level for a 4G system with THP on
> > (I'd still like to try to reduce the requirements only 1 hugepage for
> > each migratetype in the set_min_free_kbytes to reduce the requirements
> > to the minium, but only if possible..). But this saves 600M over 4G so
> > it's the highest prio to address.
> Sorry for the later response, I offlined several weeks.
> The patch is addressing the 8*high_wmark issue, which isn't the original issue
> I reported (sure the 8*wmark issue should be fixed too).
> min_free_kbytes is set higher and cause more pages freed even no the 8*wmark
> issue. wmark:
> before: min      1424
> after:	min      11178

The higher min_free_kbytes is expected as a result of using transparent
hugepages so I don't really consider it a bug. Free memory going up to
about 700M as a result of kswapd is a real bug though.

> in our test, there is about 50M memory free (originally just about 5M, which
> will cause more swap. Should we also reduce the min_free_kbytes?
> 

Either that or boot with transparent hugepages disabled and
min_free_kbytes will be lower.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-22 14:25                     ` Mel Gorman
@ 2011-02-22 14:42                       ` Andrea Arcangeli
  2011-02-22 14:50                         ` Mel Gorman
  2011-02-22 16:04                         ` Mel Gorman
  2011-02-23  5:29                       ` Shaohua Li
  1 sibling, 2 replies; 52+ messages in thread
From: Andrea Arcangeli @ 2011-02-22 14:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel, alex.shi

On Tue, Feb 22, 2011 at 02:25:59PM +0000, Mel Gorman wrote:
> The higher min_free_kbytes is expected as a result of using transparent
> hugepages so I don't really consider it a bug. Free memory going up to

That's true. THP can definitely increase the memory footprint of
certain apps. Especially if the app is allocating lots of data but
only touching a few bytes scattered over the mapping, the memory
footprint can increase up to 512fold (absolute worst case of course,
in average it will be less). This is why there's the enabled=madvise
option after all.

> about 700M as a result of kswapd is a real bug though.

Yes.

> > in our test, there is about 50M memory free (originally just about 5M, which
> > will cause more swap. Should we also reduce the min_free_kbytes?
> > 
> 
> Either that or boot with transparent hugepages disabled and
> min_free_kbytes will be lower.

I suggest to boot with transparent_hugepage=madvise, or to set the
default to madvise in make menuconfig. That will still enable the
anti-frag logic in the buddy allocator in full. If the problem goes
away with the madvise setting, then it's not related to
min_free_kbytes. With the 700M fix for kswapd however it's hard to
imagine the increase min_free_kbytes to cause out of memory conditions
even if it uses a little more memory to allow for increased
performance thanks to hugepages.

Another thing we can change (in addition to the 700M-waste fix in
kswapd) is this:

	/*
	 * By default disable transparent hugepages on smaller
	systems,
	 * where the extra memory used could hurt more than TLB
	overhead
	 * is likely to save.  The admin can still enable it through
	/sys.
	 */
	 if (totalram_pages < (512 << (20 - PAGE_SHIFT)))
	    transparent_hugepage_flags = 0;

and:

	/* don't ever allow to reserve more than 5% of the lowmem */
	recommended_min = min(recommended_min,
			        (unsigned long) nr_free_buffer_pages()
	/ 20);

We can reduce the max min_free_kbytes to less than 5% of the lowmem,
and we can also decide not to enable THP if there's less than 2G
instead of "less than 512M".

I'm also intrigued by reducing this from 2 to 1:

    /* Make sure at least 2 hugepages are free for MIGRATE_RESERVE */
    recommended_min = pageblock_nr_pages * nr_zones * 2;

Do we really need 2 pages instead of just 1 here to provide the
guarantee? I thought 1 page would be enough. But you know anti-frag
logic better ;). It won't save a lot of memory but just a couple of
mbytes, I doubt it can make any real difference. Still I prefer 1 if
it's enough.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-22 14:42                       ` Andrea Arcangeli
@ 2011-02-22 14:50                         ` Mel Gorman
  2011-02-22 14:54                           ` Andrea Arcangeli
  2011-02-22 16:04                         ` Mel Gorman
  1 sibling, 1 reply; 52+ messages in thread
From: Mel Gorman @ 2011-02-22 14:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel, alex.shi

On Tue, Feb 22, 2011 at 03:42:00PM +0100, Andrea Arcangeli wrote:
> <SNIP>
> 
> I'm also intrigued by reducing this from 2 to 1:
> 
>     /* Make sure at least 2 hugepages are free for MIGRATE_RESERVE */
>     recommended_min = pageblock_nr_pages * nr_zones * 2;
> 
> Do we really need 2 pages instead of just 1 here to provide the
> guarantee?

For workloads that cause a lot of fragmentation - yes. Simplistically with 1,
the trace event mm_page_alloc_extfrag will trigger more frequently and
it's more likely to be severe. The problem is that if it's not "* 2",
there is a very low probability that there will pages free in a suitable
pageblock and "mixing" occurs. It can take a very long time for
allocation success rates to go down but it happens eventually.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-22 14:50                         ` Mel Gorman
@ 2011-02-22 14:54                           ` Andrea Arcangeli
  0 siblings, 0 replies; 52+ messages in thread
From: Andrea Arcangeli @ 2011-02-22 14:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel, alex.shi

On Tue, Feb 22, 2011 at 02:50:31PM +0000, Mel Gorman wrote:
> For workloads that cause a lot of fragmentation - yes. Simplistically with 1,
> the trace event mm_page_alloc_extfrag will trigger more frequently and
> it's more likely to be severe. The problem is that if it's not "* 2",
> there is a very low probability that there will pages free in a suitable
> pageblock and "mixing" occurs. It can take a very long time for
> allocation success rates to go down but it happens eventually.

Ok I see. Thanks for the clarification.

So I think the other two spots I quoted in prev email are the only two
bits we can adjust if booting madvise doesn't fix it completely (in
addition to the *8 removal in kswapd, but that only affects ~4G
systems, that are however very common this is an old bug that just got
better exposed with an higher min_free_kbytes default).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-22 14:42                       ` Andrea Arcangeli
  2011-02-22 14:50                         ` Mel Gorman
@ 2011-02-22 16:04                         ` Mel Gorman
  2011-02-22 16:40                           ` Rik van Riel
  1 sibling, 1 reply; 52+ messages in thread
From: Mel Gorman @ 2011-02-22 16:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel, alex.shi

On Tue, Feb 22, 2011 at 03:42:00PM +0100, Andrea Arcangeli wrote:
> I suggest to boot with transparent_hugepage=madvise, or to set the
> default to madvise in make menuconfig. That will still enable the
> anti-frag logic in the buddy allocator in full. If the problem goes
> away with the madvise setting, then it's not related to
> min_free_kbytes. With the 700M fix for kswapd however it's hard to
> imagine the increase min_free_kbytes to cause out of memory conditions
> even if it uses a little more memory to allow for increased
> performance thanks to hugepages.
> 

We didn't really agree on a fix though, did we? At least, I don't see a
patch we all agreed on in the thread. I stuck my ack on your patch but Rik
nak'd it because he wanted the balance gap to be preserved. We had sortof
agreed on a balance gap but didn't post a patch that implemented it. AFAIK,
an implementation of what was discussed is blow. If this is not the agreed
fix, what is? If we agree on it, can Shaohua confirm the fix works?

This is against 2.6.38-rc6 which still isn't fixed and I don't see a
candidate fix in mmotm either.

==== CUT HERE ====
mm: vmscan: kswapd should not free an excessive number of pages when balancing small zones

When reclaiming for order-0 pages, kswapd requires that all zones be
balanced. Each cycle through balance_pgdat() does background ageing on all
zones if necessary and applies equal pressure on the inactive zone unless
a lot of pages are free already.

A "lot of free pages" is defined as a "balance gap" above the high watermark
which is currently 7*high_watermark. Historically this was reasonable as
min_free_kbytes was small. However, on systems using huge pages, it is
recommended that min_free_kbytes is higher and it is tuned with hugeadm
--set-recommended-min_free_kbytes. With the introduction of transparent
huge page support, this recommended value is also applied. On X86-64 with
4G of memory, min_free_kbytes becomes 67584 so one would expect around 68M
of memory to be free. The Normal zone is approximately 35000 pages so under
even normal memory pressure such as copying a large file, it gets exhausted
quickly. As it is getting exhausted, kswapd applies pressure equally to all
zones, including the DMA32 zone. DMA32 is approximately 700,000 pages with
a high watermark of around 23,000 pages. In this situation, kswapd will
reclaim around (23000*8 where 8 is the high watermark + balance gap of 7 *
high watermark) pages or 718M of pages before the zone is ignored. What
the user sees is that free memory far higher than it should be.

To avoid an excessive number of pages being reclaimed from the larger zones,
explicitely defines the "balance gap" to be either 1% of the zone or the
low watermark for the zone, whichever is smaller.  While kswapd will check
all zones to apply pressure, it'll ignore zones that meets the (high_wmark +
balance_gap) watermark.

To test this, 80G were copied from a partition and the amount of memory
being used was recorded. A comparison of a patch and unpatched kernel
can be seen at
http://www.csn.ul.ie/~mel/postings/minfree-20110222/memory-usage-hydra.ps
and shows that kswapd is not reclaiming as much memory with the patch
applied.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/swap.h |    9 +++++++++
 mm/vmscan.c          |   16 +++++++++++++---
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4d55932..a57c6e7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -155,6 +155,15 @@ enum {
 #define SWAP_CLUSTER_MAX 32
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
+/*
+ * Ratio between the present memory in the zone and the "gap" that
+ * we're allowing kswapd to shrink in addition to the per-zone high
+ * wmark, even for zones that already have the high wmark satisfied,
+ * in order to provide better per-zone lru behavior. We are ok to
+ * spend not more than 1% of the memory for this zone balancing "gap".
+ */
+#define KSWAPD_ZONE_BALANCE_GAP_RATIO 100
+
 #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
 #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
 #define SWAP_HAS_CACHE	0x40	/* Flag page is cached, in first swap_map */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 17497d0..0c83530 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2388,6 +2388,7 @@ loop_again:
 			int compaction;
 			struct zone *zone = pgdat->node_zones + i;
 			int nr_slab;
+			unsigned long balance_gap;
 
 			if (!populated_zone(zone))
 				continue;
@@ -2404,11 +2405,20 @@ loop_again:
 			mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);
 
 			/*
-			 * We put equal pressure on every zone, unless one
-			 * zone has way too many pages free already.
+			 * We put equal pressure on every zone, unless
+			 * one zone has way too many pages free
+			 * already. The "too many pages" is defined
+			 * as the high wmark plus a "gap" where the
+			 * gap is either the low watermark or 1%
+			 * of the zone, whichever is smaller.
 			 */
+			balance_gap = min(low_wmark_pages(zone),
+				(zone->present_pages +
+					KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
+				KSWAPD_ZONE_BALANCE_GAP_RATIO);
 			if (!zone_watermark_ok_safe(zone, order,
-					8*high_wmark_pages(zone), end_zone, 0))
+					high_wmark_pages(zone) + balance_gap,
+					end_zone, 0))
 				shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-22 16:04                         ` Mel Gorman
@ 2011-02-22 16:40                           ` Rik van Riel
  0 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2011-02-22 16:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Shaohua Li, Andrew Morton, linux-mm, Chen,
	Tim C, alex.shi

On 02/22/2011 11:04 AM, Mel Gorman wrote:

> To avoid an excessive number of pages being reclaimed from the larger zones,
> explicitely defines the "balance gap" to be either 1% of the zone or the
> low watermark for the zone, whichever is smaller.  While kswapd will check
> all zones to apply pressure, it'll ignore zones that meets the (high_wmark +
> balance_gap) watermark.
>
> To test this, 80G were copied from a partition and the amount of memory
> being used was recorded. A comparison of a patch and unpatched kernel
> can be seen at
> http://www.csn.ul.ie/~mel/postings/minfree-20110222/memory-usage-hydra.ps
> and shows that kswapd is not reclaiming as much memory with the patch
> applied.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-22 14:25                     ` Mel Gorman
  2011-02-22 14:42                       ` Andrea Arcangeli
@ 2011-02-23  5:29                       ` Shaohua Li
  2011-02-23 14:45                         ` Andrea Arcangeli
  1 sibling, 1 reply; 52+ messages in thread
From: Shaohua Li @ 2011-02-23  5:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Chen, Tim C,
	Rik van Riel, Shi, Alex

On Tue, 2011-02-22 at 22:25 +0800, Mel Gorman wrote:
> On Mon, Feb 14, 2011 at 10:25:24AM +0800, Shaohua Li wrote:
> > On Thu, Feb 03, 2011 at 10:58:08AM +0800, Andrea Arcangeli wrote:
> > > On Thu, Jan 27, 2011 at 04:27:55PM +0100, Andrea Arcangeli wrote:
> > > > totally untested... I will test....
> > > 
> > > The below patch is fixing my problem and working fine for me... as
> > > expected it can't possibly lead to any D state, it's pretty much like
> > > setting min_free_kbytes lower, and it's not going to alter anything
> > > other than the levels of free memory kept by kswapd.
> > > 
> > > $ while :; do ps xa|grep [k]swapd; sleep 1; done
> > >   452 ?        R      1:20 [kswapd0]
> > >   452 ?        S      1:20 [kswapd0]
> > >   452 ?        S      1:20 [kswapd0]
> > >   452 ?        S      1:20 [kswapd0]
> > >   452 ?        S      1:20 [kswapd0]
> > >   452 ?        R      1:20 [kswapd0]
> > >   452 ?        R      1:20 [kswapd0]
> > >   452 ?        R      1:20 [kswapd0]
> > >   452 ?        R      1:20 [kswapd0]
> > >   452 ?        S      1:20 [kswapd0]
> > >   452 ?        R      1:20 [kswapd0]
> > > $ vmstat 1
> > > procs -----------memory---------- ---swap-- -----io---- -system--
> > >   ----cpu----
> > >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us
> > >   sy id wa
> > >  2  1   1784 111040 2393336 807924    0    0    63   992   56   70  1   1 96  2
> > >  0  1   1784 108928 2402556 801864    0    0 122624     0 1619 2150  0   5 80 16
> > >  0  1   1784 110664 2401244 801140    0    0 122496     0 1602 2081  0   3 81 16
> > >  0  1   1784 109796 2410184 792984    0    0 122752     0 1685 2149  0   4 80 16
> > >  0  1   1784 110416 2411856 791208    0    0 120448     4 1599 2075  0   4 81 16
> > >  1  0   1784 113516 2415344 785336    0    0 122496     0 1636 2125  0   4 81 15
> > > 
> > > I doubt we'll get any regression because of the below (see also my
> > > prev email in this thread), and I would only expect more cache and
> > > maybe better lru. Previously the free memory levels were stuck at
> > > ~700M now they're stuck at the right level for a 4G system with THP on
> > > (I'd still like to try to reduce the requirements only 1 hugepage for
> > > each migratetype in the set_min_free_kbytes to reduce the requirements
> > > to the minium, but only if possible..). But this saves 600M over 4G so
> > > it's the highest prio to address.
> > Sorry for the later response, I offlined several weeks.
> > The patch is addressing the 8*high_wmark issue, which isn't the original issue
> > I reported (sure the 8*wmark issue should be fixed too).
> > min_free_kbytes is set higher and cause more pages freed even no the 8*wmark
> > issue. wmark:
> > before: min      1424
> > after:	min      11178
> 
> The higher min_free_kbytes is expected as a result of using transparent
> hugepages so I don't really consider it a bug. Free memory going up to
> about 700M as a result of kswapd is a real bug though.
> 
> > in our test, there is about 50M memory free (originally just about 5M, which
> > will cause more swap. Should we also reduce the min_free_kbytes?
> > 
> 
> Either that or boot with transparent hugepages disabled and
> min_free_kbytes will be lower.
Fixing it will let more people enable THP by default. but anyway we will
disable it now if the issue can't be fixed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-23  5:29                       ` Shaohua Li
@ 2011-02-23 14:45                         ` Andrea Arcangeli
  2011-02-24  8:08                           ` Shaohua Li
  0 siblings, 1 reply; 52+ messages in thread
From: Andrea Arcangeli @ 2011-02-23 14:45 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Mel Gorman, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel,
	Shi, Alex

On Wed, Feb 23, 2011 at 01:29:14PM +0800, Shaohua Li wrote:
> Fixing it will let more people enable THP by default. but anyway we will
> disable it now if the issue can't be fixed.

Did you try what happens with transparent_hugepage=madvise? If that
doesn't fix it, it's min_free_kbytes issue.

Also if you're using an heavily threaded application, decreasing the
stack size with pthread_attr_setstack to something like 16k will fix
it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-23 14:45                         ` Andrea Arcangeli
@ 2011-02-24  8:08                           ` Shaohua Li
  2011-02-24  9:52                             ` Mel Gorman
  2011-02-24 14:04                             ` Andrea Arcangeli
  0 siblings, 2 replies; 52+ messages in thread
From: Shaohua Li @ 2011-02-24  8:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel,
	Shi, Alex, Andi Kleen

On Wed, 2011-02-23 at 22:45 +0800, Andrea Arcangeli wrote:
> On Wed, Feb 23, 2011 at 01:29:14PM +0800, Shaohua Li wrote:
> > Fixing it will let more people enable THP by default. but anyway we will
> > disable it now if the issue can't be fixed.
> 
> Did you try what happens with transparent_hugepage=madvise? If that
> doesn't fix it, it's min_free_kbytes issue.
with madvise, the min_free_kbytes is still high (same as the 'always'
case). The result is still we have about 50M memory is reserved. you can
try at your machine with boot option 'mem=2G' and check the zoneinfo
output.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-24  8:08                           ` Shaohua Li
@ 2011-02-24  9:52                             ` Mel Gorman
  2011-02-24  9:57                               ` Mel Gorman
  2011-02-24 14:04                             ` Andrea Arcangeli
  1 sibling, 1 reply; 52+ messages in thread
From: Mel Gorman @ 2011-02-24  9:52 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Chen, Tim C,
	Rik van Riel, Shi, Alex, Andi Kleen

On Thu, Feb 24, 2011 at 04:08:47PM +0800, Shaohua Li wrote:
> On Wed, 2011-02-23 at 22:45 +0800, Andrea Arcangeli wrote:
> > On Wed, Feb 23, 2011 at 01:29:14PM +0800, Shaohua Li wrote:
> > > Fixing it will let more people enable THP by default. but anyway we will
> > > disable it now if the issue can't be fixed.
> > 
> > Did you try what happens with transparent_hugepage=madvise? If that
> > doesn't fix it, it's min_free_kbytes issue.
> with madvise, the min_free_kbytes is still high (same as the 'always'
> case).

This high min_free_kbytes is expected and is not considered a bug as it's
related to transparent hugepages being able to allocate huge pages for a
long period of time. Essentially, it's a cost of using hugepages.

> The result is still we have about 50M memory is reserved. you can
> try at your machine with boot option 'mem=2G' and check the zoneinfo
> output.
> 

Is the actual free memory around the 50M mark or is it far higher than
it should be?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-24  9:52                             ` Mel Gorman
@ 2011-02-24  9:57                               ` Mel Gorman
  2011-02-24 14:27                                 ` Andrea Arcangeli
  0 siblings, 1 reply; 52+ messages in thread
From: Mel Gorman @ 2011-02-24  9:57 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Chen, Tim C,
	Rik van Riel, Shi, Alex, Andi Kleen

On Thu, Feb 24, 2011 at 09:52:09AM +0000, Mel Gorman wrote:
> On Thu, Feb 24, 2011 at 04:08:47PM +0800, Shaohua Li wrote:
> > On Wed, 2011-02-23 at 22:45 +0800, Andrea Arcangeli wrote:
> > > On Wed, Feb 23, 2011 at 01:29:14PM +0800, Shaohua Li wrote:
> > > > Fixing it will let more people enable THP by default. but anyway we will
> > > > disable it now if the issue can't be fixed.
> > > 
> > > Did you try what happens with transparent_hugepage=madvise? If that
> > > doesn't fix it, it's min_free_kbytes issue.
> > with madvise, the min_free_kbytes is still high (same as the 'always'
> > case).
> 
> This high min_free_kbytes is expected and is not considered a bug as it's
> related to transparent hugepages being able to allocate huge pages for a
> long period of time. Essentially, it's a cost of using hugepages.
> 

I should be clearer here. madvise|always sets a high min_free_kbytes by
this check

        if (ret > 0 &&
            (test_bit(TRANSPARENT_HUGEPAGE_FLAG,
                      &transparent_hugepage_flags) ||
             test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
                      &transparent_hugepage_flags)))
                set_recommended_min_free_kbytes();

so I'd expect the new higher value for min_free_kbytes once THP was ever
expected to be used.

If this new value was still considered a bug, removing the call to
set_recommended_min_free_kbytes() would always use the lower value that
was used in older kernels. This would "fix" the bug but transparent hugepage
users would not get the pages they expected the longer the system was running.
This would be harder for ordinary users to catch.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-24  9:57                               ` Mel Gorman
@ 2011-02-24 14:27                                 ` Andrea Arcangeli
  0 siblings, 0 replies; 52+ messages in thread
From: Andrea Arcangeli @ 2011-02-24 14:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel,
	Shi, Alex, Andi Kleen

On Thu, Feb 24, 2011 at 09:57:27AM +0000, Mel Gorman wrote:
> I should be clearer here. madvise|always sets a high min_free_kbytes by
> this check
> 
>         if (ret > 0 &&
>             (test_bit(TRANSPARENT_HUGEPAGE_FLAG,
>                       &transparent_hugepage_flags) ||
>              test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
>                       &transparent_hugepage_flags)))
>                 set_recommended_min_free_kbytes();
> 
> so I'd expect the new higher value for min_free_kbytes once THP was ever
> expected to be used.
> 
> If this new value was still considered a bug, removing the call to
> set_recommended_min_free_kbytes() would always use the lower value that
> was used in older kernels. This would "fix" the bug but transparent hugepage
> users would not get the pages they expected the longer the system was running.
> This would be harder for ordinary users to catch.

This is a safe default for TRANSPARENT_HUGEPAGE_FLAG. All servers
will want set_recommended_min_free_kbytes. All we can argue on the
TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG setting if it needs this or not
(maybe we can remove the TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG check
considering madvise is mostly for embedded systems that can't waste a
byte in case THP increases the memory footprint of the program but
they still want to use THP for embedded virt or similar usages that
don't waste any memory at peak load).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-24  8:08                           ` Shaohua Li
  2011-02-24  9:52                             ` Mel Gorman
@ 2011-02-24 14:04                             ` Andrea Arcangeli
  2011-02-25  0:51                               ` Shaohua Li
  1 sibling, 1 reply; 52+ messages in thread
From: Andrea Arcangeli @ 2011-02-24 14:04 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Mel Gorman, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel,
	Shi, Alex, Andi Kleen

On Thu, Feb 24, 2011 at 04:08:47PM +0800, Shaohua Li wrote:
> with madvise, the min_free_kbytes is still high (same as the 'always'
> case). The result is still we have about 50M memory is reserved. you can
> try at your machine with boot option 'mem=2G' and check the zoneinfo
> output.

yes I know. The objective of that test was exactly to know if the
problem is higher memory footprint because of THP or only the
anti-frag/min_free_kbytes which would still be present with the
"madvise" setting (anti-frag is only shutdown by the "never"
setting). If you still have the out of memory with madvise, then you
can keep THP enabled "always" and then "echo 16384 >
/proc/sys/vm/min_free_kbytes", it should work fine then even with THP
always mode then, no need to disable THP (simply you won't have a good
guarantee that anti-frag is functional so the hugepage usage will be
reduced over time compared to the default min_free_kbytes that enables
anti-frag fully).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-24 14:04                             ` Andrea Arcangeli
@ 2011-02-25  0:51                               ` Shaohua Li
  2011-02-25 12:13                                 ` Mel Gorman
  0 siblings, 1 reply; 52+ messages in thread
From: Shaohua Li @ 2011-02-25  0:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel,
	Shi, Alex, Andi Kleen

On Thu, 2011-02-24 at 22:04 +0800, Andrea Arcangeli wrote:
> On Thu, Feb 24, 2011 at 04:08:47PM +0800, Shaohua Li wrote:
> > with madvise, the min_free_kbytes is still high (same as the 'always'
> > case). The result is still we have about 50M memory is reserved. you can
> > try at your machine with boot option 'mem=2G' and check the zoneinfo
> > output.
> 
> yes I know. The objective of that test was exactly to know if the
> problem is higher memory footprint because of THP or only the
> anti-frag/min_free_kbytes which would still be present with the
> "madvise" setting (anti-frag is only shutdown by the "never"
> setting). If you still have the out of memory with madvise, then you
> can keep THP enabled "always" and then "echo 16384 >
> /proc/sys/vm/min_free_kbytes", it should work fine then even with THP
> always mode then, no need to disable THP (simply you won't have a good
> guarantee that anti-frag is functional so the hugepage usage will be
> reduced over time compared to the default min_free_kbytes that enables
> anti-frag fully).
I can disable THP or set the min_free_kbytes manually in our test, but
just wonder if it's possible we can avoid the memory waste even with THP
enabled, because this will make more people enable it by default. If you
don't consider this is a problem, we can disable THP.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-25  0:51                               ` Shaohua Li
@ 2011-02-25 12:13                                 ` Mel Gorman
  0 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2011-02-25 12:13 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Chen, Tim C,
	Rik van Riel, Shi, Alex, Andi Kleen

On Fri, Feb 25, 2011 at 08:51:49AM +0800, Shaohua Li wrote:
> On Thu, 2011-02-24 at 22:04 +0800, Andrea Arcangeli wrote:
> > On Thu, Feb 24, 2011 at 04:08:47PM +0800, Shaohua Li wrote:
> > > with madvise, the min_free_kbytes is still high (same as the 'always'
> > > case). The result is still we have about 50M memory is reserved. you can
> > > try at your machine with boot option 'mem=2G' and check the zoneinfo
> > > output.
> > 
> > yes I know. The objective of that test was exactly to know if the
> > problem is higher memory footprint because of THP or only the
> > anti-frag/min_free_kbytes which would still be present with the
> > "madvise" setting (anti-frag is only shutdown by the "never"
> > setting). If you still have the out of memory with madvise, then you
> > can keep THP enabled "always" and then "echo 16384 >
> > /proc/sys/vm/min_free_kbytes", it should work fine then even with THP
> > always mode then, no need to disable THP (simply you won't have a good
> > guarantee that anti-frag is functional so the hugepage usage will be
> > reduced over time compared to the default min_free_kbytes that enables
> > anti-frag fully).
>
> I can disable THP or set the min_free_kbytes manually in our test, but
> just wonder if it's possible we can avoid the memory waste even with THP
> enabled, because this will make more people enable it by default.

With a lower value of min_free_kbytes, THP would give diminishing returns
over time as hugepage allocation success rates start degrading over time. It
might not happen for several days or weeks making it a tricky problem to
diagnose. So yes, the memory waste with THP enabled can be fixed but it
would only be suitable for short-term benchmarks.

> If you
> don't consider this is a problem, we can disable THP.
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-01-27 15:27               ` Andrea Arcangeli
  2011-01-27 16:03                 ` Mel Gorman
  2011-02-03  2:58                 ` Andrea Arcangeli
@ 2011-02-12  9:48                 ` alex shi
  2011-02-22 14:24                   ` Mel Gorman
  2 siblings, 1 reply; 52+ messages in thread
From: alex shi @ 2011-02-12  9:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C, alex.shi

[-- Attachment #1: Type: text/plain, Size: 3445 bytes --]

I am tried the patch, but seems it has no effect for our regression.

Regards
Alex

On Thu, Jan 27, 2011 at 11:27 PM, Andrea Arcangeli <aarcange@redhat.com>wrote:

> On Thu, Jan 27, 2011 at 01:40:58PM +0000, Mel Gorman wrote:
> > On Wed, Jan 26, 2011 at 05:42:37PM +0000, Mel Gorman wrote:
> > > On Wed, Jan 26, 2011 at 04:36:55PM +0000, Mel Gorman wrote:
> > > > > But the wmarks don't
> > > > > seem the real offender, maybe it's something related to the tiny
> pci32
> > > > > zone that materialize on 4g systems that relocate some little
> memory
> > > > > over 4g to make space for the pci32 mmio. I didn't yet finish to
> debug
> > > > > it.
> > > > >
> > > >
> > > > This has to be it. What I think is happening is that we're in
> balance_pgdat(),
> > > > the "Normal" zone is never hitting the watermark and we constantly
> call
> > > > "goto loop_again" trying to "rebalance" all zones.
> > > >
> > >
> > > Confirmed.
> > > <SNIP>
> >
> > How about the following? Functionally it would work but I am concerned
> > that the logic in balance_pgdat() and kswapd() is getting out of hand
> > having being adjusted to work with a number of corner cases already. In
> > the next cycle, it could do with a "do-over" attempt to make it easier
> > to follow.
>
> That number 8 is the problem, I don't think anybody was ever supposed
> to free 8*highwmark pages. kswapd must work in the hysteresis range
> low->high area and then sleep wait low to hit again before it gets
> wakenup. Not sure how that number 8 ever come up... but to be it looks
> like the real offender and I wouldn't work around it.
>
> totally untested... I will test....
>
> ====
> Subject: vmscan: kswapd must not free more than high_wmark pages
>
> From: Andrea Arcangeli <aarcange@redhat.com>
>
> When the min_free_kbytes is set with `hugeadm
> --set-recommended-min_free_kbytes" or with THP enabled (which runs the
> equivalent of "hugeadm --set-recommended-min_free_kbytes" to activate
> anti-frag at full effectiveness automatically at boot) the high wmark
> of some zone is as high as ~88M. 88M free on a 4G system isn't
> horrible, but 88M*8 = 704M free on a 4G system is definitely
> unbearable. This only tends to be visible on 4G systems with tiny
> over-4g zone where kswapd insists to reach the high wmark on the
> over-4g zone but doing so it shrunk up to 704M from the normal zone by
> mistake.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f5d90de..9e3c78e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2407,7 +2407,7 @@ loop_again:
>                         * zone has way too many pages free already.
>                         */
>                         if (!zone_watermark_ok_safe(zone, order,
> -                                       8*high_wmark_pages(zone), end_zone,
> 0))
> +                                       high_wmark_pages(zone), end_zone,
> 0))
>                                 shrink_zone(priority, zone, &sc);
>                        reclaim_state->reclaimed_slab = 0;
>                        nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

[-- Attachment #2: Type: text/html, Size: 4702 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: too big min_free_kbytes
  2011-02-12  9:48                 ` alex shi
@ 2011-02-22 14:24                   ` Mel Gorman
  0 siblings, 0 replies; 52+ messages in thread
From: Mel Gorman @ 2011-02-22 14:24 UTC (permalink / raw)
  To: alex shi
  Cc: Andrea Arcangeli, Shaohua Li, Andrew Morton, linux-mm, Chen,
	Tim C, alex.shi

On Sat, Feb 12, 2011 at 05:48:55PM +0800, alex shi wrote:
> I am tried the patch, but seems it has no effect for our regression.
> 

What is the nature of your regression? I see no details of it in the
thread.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2011-02-25 12:14 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-24  3:56 too big min_free_kbytes Shaohua Li
2011-01-24 15:00 ` Andrea Arcangeli
2011-01-25 14:35   ` Mel Gorman
2011-01-26 14:17   ` Mel Gorman
2011-01-26 15:23     ` Mel Gorman
2011-01-26 15:42       ` Andrea Arcangeli
2011-01-26 16:36         ` Mel Gorman
2011-01-26 17:42           ` Mel Gorman
2011-01-27 13:40             ` Mel Gorman
2011-01-27 15:27               ` Andrea Arcangeli
2011-01-27 16:03                 ` Mel Gorman
2011-01-27 18:52                   ` Andrea Arcangeli
2011-01-27 20:33                     ` Rik van Riel
2011-01-27 21:31                     ` Mel Gorman
2011-01-27 23:18                       ` Rik van Riel
2011-01-28 10:35                         ` Mel Gorman
2011-01-28 16:28                           ` Andrea Arcangeli
2011-01-28 16:46                             ` Mel Gorman
2011-01-28 17:16                               ` Rik van Riel
2011-01-28 17:46                                 ` Andrea Arcangeli
2011-01-28 18:03                                   ` Rik van Riel
2011-01-28 18:24                                     ` Andrea Arcangeli
2011-01-28 19:34                                       ` Rik van Riel
2011-01-28 19:45                                         ` Andrea Arcangeli
2011-01-28 20:55                                           ` Rik van Riel
2011-01-29 19:45                                             ` Andrea Arcangeli
2011-01-28 17:34                               ` Andrea Arcangeli
2011-01-28 17:10                             ` Rik van Riel
2011-02-03  2:58                 ` Andrea Arcangeli
2011-02-03 13:15                   ` Mel Gorman
2011-02-03 18:59                     ` Andrea Arcangeli
2011-02-03 14:36                   ` Rik van Riel
2011-02-03 19:11                     ` Andrea Arcangeli
2011-02-12  1:28                       ` Simon Kirby
2011-02-14  2:25                   ` Shaohua Li
2011-02-22 14:25                     ` Mel Gorman
2011-02-22 14:42                       ` Andrea Arcangeli
2011-02-22 14:50                         ` Mel Gorman
2011-02-22 14:54                           ` Andrea Arcangeli
2011-02-22 16:04                         ` Mel Gorman
2011-02-22 16:40                           ` Rik van Riel
2011-02-23  5:29                       ` Shaohua Li
2011-02-23 14:45                         ` Andrea Arcangeli
2011-02-24  8:08                           ` Shaohua Li
2011-02-24  9:52                             ` Mel Gorman
2011-02-24  9:57                               ` Mel Gorman
2011-02-24 14:27                                 ` Andrea Arcangeli
2011-02-24 14:04                             ` Andrea Arcangeli
2011-02-25  0:51                               ` Shaohua Li
2011-02-25 12:13                                 ` Mel Gorman
2011-02-12  9:48                 ` alex shi
2011-02-22 14:24                   ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.