* too big min_free_kbytes @ 2011-01-24 3:56 Shaohua Li 2011-01-24 15:00 ` Andrea Arcangeli 0 siblings, 1 reply; 52+ messages in thread From: Shaohua Li @ 2011-01-24 3:56 UTC (permalink / raw) To: Andrew Morton, aarcange; +Cc: linux-mm, Chen, Tim C Hi, With transparent huge page, min_free_kbytes is set too big. Before: Node 0, zone DMA32 pages free 1812 min 1424 low 1780 high 2136 scanned 0 spanned 519168 present 511496 After: Node 0, zone DMA32 pages free 482708 min 11178 low 13972 high 16767 scanned 0 spanned 519168 present 511496 This caused different performance problems in our test. I wonder why we set the value so big. Thanks, Shaohua -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-24 3:56 too big min_free_kbytes Shaohua Li @ 2011-01-24 15:00 ` Andrea Arcangeli 2011-01-25 14:35 ` Mel Gorman 2011-01-26 14:17 ` Mel Gorman 0 siblings, 2 replies; 52+ messages in thread From: Andrea Arcangeli @ 2011-01-24 15:00 UTC (permalink / raw) To: Shaohua Li; +Cc: Andrew Morton, linux-mm, Chen, Tim C, Mel Gorman eOn Mon, Jan 24, 2011 at 11:56:46AM +0800, Shaohua Li wrote: > Hi, > With transparent huge page, min_free_kbytes is set too big. > Before: > Node 0, zone DMA32 > pages free 1812 > min 1424 > low 1780 > high 2136 > scanned 0 > spanned 519168 > present 511496 > > After: > Node 0, zone DMA32 > pages free 482708 > min 11178 > low 13972 > high 16767 > scanned 0 > spanned 519168 > present 511496 > This caused different performance problems in our test. I wonder why we > set the value so big. It's to enable Mel's anti-frag that keeps pageblocks with movable and unmovable stuff separated, same as "hugeadm --set-recommended-min_free_kbytes". Now that I checked, I'm seeing quite too much free memory with only 4G of ram... You can see the difference with a "cp /dev/sda /dev/null" in background interleaving these two commands: echo always >/sys/kernel/mm/transparent_hugepage/enabled echo 1000 > /proc/sys/vm/min_free_kbytes The setting of min_free_kbytes to 67584 leads to 716MB of memory free. Setting to 1000 leads to 20MB free. I'm afraid losing 716MB on a 4G system is way excessive regardless of THP... can't we just have a version of anti-frag that reserves a lot fewers pageblocks? Anti-frag is quite important to avoid slab to fragment everything. I don't think we can leave it like this. For now you can workaround with the above echo 1000 > ... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-24 15:00 ` Andrea Arcangeli @ 2011-01-25 14:35 ` Mel Gorman 2011-01-26 14:17 ` Mel Gorman 1 sibling, 0 replies; 52+ messages in thread From: Mel Gorman @ 2011-01-25 14:35 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C Sorry for the long delay in replying. I've been out the last week and am not properly back until tomorrow. On Mon, Jan 24, 2011 at 04:00:34PM +0100, Andrea Arcangeli wrote: > eOn Mon, Jan 24, 2011 at 11:56:46AM +0800, Shaohua Li wrote: > > Hi, > > With transparent huge page, min_free_kbytes is set too big. > > Before: > > Node 0, zone DMA32 > > pages free 1812 > > min 1424 > > low 1780 > > high 2136 > > scanned 0 > > spanned 519168 > > present 511496 > > > > After: > > Node 0, zone DMA32 > > pages free 482708 > > min 11178 > > low 13972 > > high 16767 > > scanned 0 > > spanned 519168 > > present 511496 > > This caused different performance problems in our test. I wonder why we > > set the value so big. > > It's to enable Mel's anti-frag that keeps pageblocks with movable and > unmovable stuff separated, same as "hugeadm > --set-recommended-min_free_kbytes". > It's not so much "make it work" as "make it work better". The effect can be measured by recording the mm_page_alloc_extfrag event. The more times it occurs, the worse fragmentation can get. The event also reports whether it is severe or not. > Now that I checked, I'm seeing quite too much free memory with only 4G > of ram... You can see the difference with a "cp /dev/sda /dev/null" in > background interleaving these two commands: > There is more than just min_free_kbytes happening here. The high watermark goes to 16M-ish but the amount of free memory is *way* above that watermark. Something is causing page reclaim to be a lot more agressive than it should be. Is there a difference with THP enabled and disabled but leaving min_free_kbytes alone? My preliminary theory is that 2M pages are being requested and kswapd is being woken up when it shouldn't (__GFP_NO_KSWAPD not specified when it should be). Unfortunately I do not have access to source at the moment to double check. > echo always >/sys/kernel/mm/transparent_hugepage/enabled > echo 1000 > /proc/sys/vm/min_free_kbytes > > The setting of min_free_kbytes to 67584 leads to 716MB of memory > free. Setting to 1000 leads to 20MB free. I'm afraid losing 716MB on a > 4G system is way excessive regardless of THP... Agreed. > can't we just have a > version of anti-frag that reserves a lot fewers pageblocks? Anti-frag doesn't really take any additional special action due to min_free_kbytes and it shouldn't be clearing out pageblocks aggressively like this. I think it would also be worth checking how often the mm_vmscan_kswapd_wake and mm_vmscan_wakeup_kswapd trace events are triggering. If mm_vmscan_wakeup_kswapd is triggering a lot, a stack trace of the most common triggering event might give a clue as to what is going wrong. > Anti-frag > is quite important to avoid slab to fragment everything. I don't think > we can leave it like this. > > For now you can workaround with the above echo 1000 > ... > Agreed. I'll try find time to investigate before the week is out but after being offline for a week, I've a lot of catching up to do. -- Mel Gorman Linux Technology Center IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-24 15:00 ` Andrea Arcangeli 2011-01-25 14:35 ` Mel Gorman @ 2011-01-26 14:17 ` Mel Gorman 2011-01-26 15:23 ` Mel Gorman 1 sibling, 1 reply; 52+ messages in thread From: Mel Gorman @ 2011-01-26 14:17 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On Mon, Jan 24, 2011 at 04:00:34PM +0100, Andrea Arcangeli wrote: > eOn Mon, Jan 24, 2011 at 11:56:46AM +0800, Shaohua Li wrote: > > Hi, > > With transparent huge page, min_free_kbytes is set too big. > > Before: > > Node 0, zone DMA32 > > pages free 1812 > > min 1424 > > low 1780 > > high 2136 > > scanned 0 > > spanned 519168 > > present 511496 > > > > After: > > Node 0, zone DMA32 > > pages free 482708 > > min 11178 > > low 13972 > > high 16767 > > scanned 0 > > spanned 519168 > > present 511496 > > This caused different performance problems in our test. I wonder why we > > set the value so big. > > It's to enable Mel's anti-frag that keeps pageblocks with movable and > unmovable stuff separated, same as "hugeadm > --set-recommended-min_free_kbytes". > > Now that I checked, I'm seeing quite too much free memory with only 4G > of ram... You can see the difference with a "cp /dev/sda /dev/null" in > background interleaving these two commands: > What kernel is this and is commit [99504748: mm: kswapd: stop high-order balancing when any suitable zone is balanced] present in the kernel you are testing? I'm having very little luck reproducing your scenario with 2.6.38-rc2. min_free_kbytes is as expected and the free memory is close to expectations when copying /dev/sda to /dev/null with or without transparent hugepages. -- Mel Gorman Linux Technology Center IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-26 14:17 ` Mel Gorman @ 2011-01-26 15:23 ` Mel Gorman 2011-01-26 15:42 ` Andrea Arcangeli 0 siblings, 1 reply; 52+ messages in thread From: Mel Gorman @ 2011-01-26 15:23 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On Wed, Jan 26, 2011 at 02:17:46PM +0000, Mel Gorman wrote: > On Mon, Jan 24, 2011 at 04:00:34PM +0100, Andrea Arcangeli wrote: > > eOn Mon, Jan 24, 2011 at 11:56:46AM +0800, Shaohua Li wrote: > > > Hi, > > > With transparent huge page, min_free_kbytes is set too big. > > > Before: > > > Node 0, zone DMA32 > > > pages free 1812 > > > min 1424 > > > low 1780 > > > high 2136 > > > scanned 0 > > > spanned 519168 > > > present 511496 > > > > > > After: > > > Node 0, zone DMA32 > > > pages free 482708 > > > min 11178 > > > low 13972 > > > high 16767 > > > scanned 0 > > > spanned 519168 > > > present 511496 > > > This caused different performance problems in our test. I wonder why we > > > set the value so big. > > > > It's to enable Mel's anti-frag that keeps pageblocks with movable and > > unmovable stuff separated, same as "hugeadm > > --set-recommended-min_free_kbytes". > > > > Now that I checked, I'm seeing quite too much free memory with only 4G > > of ram... You can see the difference with a "cp /dev/sda /dev/null" in > > background interleaving these two commands: > > > > What kernel is this and is commit > [99504748: mm: kswapd: stop high-order balancing when any suitable zone > is balanced] present in the kernel you are testing? > > I'm having very little luck reproducing your scenario with > 2.6.38-rc2. Scratch that, a machine with 4G does reproduce it. The machine I was trying was 2G. Will dig more. -- Mel Gorman Linux Technology Center IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-26 15:23 ` Mel Gorman @ 2011-01-26 15:42 ` Andrea Arcangeli 2011-01-26 16:36 ` Mel Gorman 0 siblings, 1 reply; 52+ messages in thread From: Andrea Arcangeli @ 2011-01-26 15:42 UTC (permalink / raw) To: Mel Gorman; +Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On Wed, Jan 26, 2011 at 03:23:02PM +0000, Mel Gorman wrote: > On Wed, Jan 26, 2011 at 02:17:46PM +0000, Mel Gorman wrote: > > On Mon, Jan 24, 2011 at 04:00:34PM +0100, Andrea Arcangeli wrote: > > > eOn Mon, Jan 24, 2011 at 11:56:46AM +0800, Shaohua Li wrote: > > > > Hi, > > > > With transparent huge page, min_free_kbytes is set too big. > > > > Before: > > > > Node 0, zone DMA32 > > > > pages free 1812 > > > > min 1424 > > > > low 1780 > > > > high 2136 > > > > scanned 0 > > > > spanned 519168 > > > > present 511496 > > > > > > > > After: > > > > Node 0, zone DMA32 > > > > pages free 482708 > > > > min 11178 > > > > low 13972 > > > > high 16767 > > > > scanned 0 > > > > spanned 519168 > > > > present 511496 > > > > This caused different performance problems in our test. I wonder why we > > > > set the value so big. > > > > > > It's to enable Mel's anti-frag that keeps pageblocks with movable and > > > unmovable stuff separated, same as "hugeadm > > > --set-recommended-min_free_kbytes". > > > > > > Now that I checked, I'm seeing quite too much free memory with only 4G > > > of ram... You can see the difference with a "cp /dev/sda /dev/null" in > > > background interleaving these two commands: > > > > > > > What kernel is this and is commit > > [99504748: mm: kswapd: stop high-order balancing when any suitable zone > > is balanced] present in the kernel you are testing? > > > > I'm having very little luck reproducing your scenario with > > 2.6.38-rc2. > > Scratch that, a machine with 4G does reproduce it. The machine I was > trying was 2G. Will dig more. I can't reproduce on a 16G system (there I never get more than an hundred mbyte free even with cp in background, which is very fine for 16G). I only reproduce on my 4G workstation, and it happens also after echo never >enabled (so without THP). I was reproducing it with "cp" anyway which isn't triggering THP allocations but I verified to be sure. When I start cp kswapd wasn't running yet, so free levels go down to 170M, then kswapd starts and it frees 700M and then 700m remains free forever until I stop "cp". The high wmark are never set to more than 85M for the normal zone, which is not excessively horrible. I'd still like to lower the wmark though! (there are 2 pageblocks reserved in the min watermark for each type, why not just 1? removing that *2 would already halve it saving some 40M of ram!). But the wmarks don't seem the real offender, maybe it's something related to the tiny pci32 zone that materialize on 4g systems that relocate some little memory over 4g to make space for the pci32 mmio. I didn't yet finish to debug it. However in presence of memory pressure the low wmark is the limit not the high wmark (and when kswapd isn't running free levels already go down to 170M even where I can reproduce). Maybe the failure with too much memory free may be only because of the increased wmark from some 20M to ~100M, and maybe I'm seeing something unrelated to that problem. __GFP_NO_KSWAPD I exclude is the issue as it happens without THP too and there's just one place where huge_memory.c allocates memory. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-26 15:42 ` Andrea Arcangeli @ 2011-01-26 16:36 ` Mel Gorman 2011-01-26 17:42 ` Mel Gorman 0 siblings, 1 reply; 52+ messages in thread From: Mel Gorman @ 2011-01-26 16:36 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On Wed, Jan 26, 2011 at 04:42:03PM +0100, Andrea Arcangeli wrote: > On Wed, Jan 26, 2011 at 03:23:02PM +0000, Mel Gorman wrote: > > On Wed, Jan 26, 2011 at 02:17:46PM +0000, Mel Gorman wrote: > > > On Mon, Jan 24, 2011 at 04:00:34PM +0100, Andrea Arcangeli wrote: > > > > eOn Mon, Jan 24, 2011 at 11:56:46AM +0800, Shaohua Li wrote: > > > > > Hi, > > > > > With transparent huge page, min_free_kbytes is set too big. > > > > > Before: > > > > > Node 0, zone DMA32 > > > > > pages free 1812 > > > > > min 1424 > > > > > low 1780 > > > > > high 2136 > > > > > scanned 0 > > > > > spanned 519168 > > > > > present 511496 > > > > > > > > > > After: > > > > > Node 0, zone DMA32 > > > > > pages free 482708 > > > > > min 11178 > > > > > low 13972 > > > > > high 16767 > > > > > scanned 0 > > > > > spanned 519168 > > > > > present 511496 > > > > > This caused different performance problems in our test. I wonder why we > > > > > set the value so big. > > > > > > > > It's to enable Mel's anti-frag that keeps pageblocks with movable and > > > > unmovable stuff separated, same as "hugeadm > > > > --set-recommended-min_free_kbytes". > > > > > > > > Now that I checked, I'm seeing quite too much free memory with only 4G > > > > of ram... You can see the difference with a "cp /dev/sda /dev/null" in > > > > background interleaving these two commands: > > > > > > > > > > What kernel is this and is commit > > > [99504748: mm: kswapd: stop high-order balancing when any suitable zone > > > is balanced] present in the kernel you are testing? > > > > > > I'm having very little luck reproducing your scenario with > > > 2.6.38-rc2. > > > > Scratch that, a machine with 4G does reproduce it. The machine I was > > trying was 2G. Will dig more. > > I can't reproduce on a 16G system (there I never get more than an > hundred mbyte free even with cp in background, which is very fine for > 16G). > It's a balancing problem in kswapd. From my preliminary examination using ftrace, I determined that kswapd is never trying to go to sleep and continually shrinking lists so it must be stuck in balance_pgdat(). > I only reproduce on my 4G workstation, and it happens also after echo > never >enabled (so without THP). I was reproducing it with "cp" anyway > which isn't triggering THP allocations but I verified to be sure. When > I start cp kswapd wasn't running yet, so free levels go down to 170M, > then kswapd starts and it frees 700M and then 700m remains free > forever until I stop "cp". This has nothing to do with THP. It should be possible to trigger on any 4G machine or specifically where the top zone is very small. > The high wmark are never set to more than > 85M for the normal zone, which is not excessively horrible. I'd still > like to lower the wmark though! (there are 2 pageblocks reserved in > the min watermark for each type, why not just 1? removing that *2 > would already halve it saving some 40M of ram!). This is a separate topic, lets not get side-tracked. Short answer, it comes down to at the time when no pageblock of the appropriate migratetype is free, we want on average one full pageblock to be free of another type so it can be converted. This limits the amount of "mixing" of pages of different migratetype in the same pageblock. The effect can be monitored using the extfrag tracepoint. > But the wmarks don't > seem the real offender, maybe it's something related to the tiny pci32 > zone that materialize on 4g systems that relocate some little memory > over 4g to make space for the pci32 mmio. I didn't yet finish to debug > it. > This has to be it. What I think is happening is that we're in balance_pgdat(), the "Normal" zone is never hitting the watermark and we constantly call "goto loop_again" trying to "rebalance" all zones. > However in presence of memory pressure the low wmark is the limit not > the high wmark (and when kswapd isn't running free levels already go > down to 170M even where I can reproduce). Maybe the failure with too > much memory free may be only because of the increased wmark from some > 20M to ~100M, and maybe I'm seeing something unrelated to that > problem. I very strongly suspect it's just because your Normal zone is never being balanced and kswapd is never breaking out of balance_pgdat() as a result. I hope to confirm before I get knocked back offline (my access to test machines is currently heavily disrupted). > __GFP_NO_KSWAPD I exclude is the issue as it happens without > THP too and there's just one place where huge_memory.c allocates > memory. Agreed, it's nothing to do with __GFP_NO_KSWAPD from what I've seen so far. -- Mel Gorman Linux Technology Center IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-26 16:36 ` Mel Gorman @ 2011-01-26 17:42 ` Mel Gorman 2011-01-27 13:40 ` Mel Gorman 0 siblings, 1 reply; 52+ messages in thread From: Mel Gorman @ 2011-01-26 17:42 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On Wed, Jan 26, 2011 at 04:36:55PM +0000, Mel Gorman wrote: > > But the wmarks don't > > seem the real offender, maybe it's something related to the tiny pci32 > > zone that materialize on 4g systems that relocate some little memory > > over 4g to make space for the pci32 mmio. I didn't yet finish to debug > > it. > > > > This has to be it. What I think is happening is that we're in balance_pgdat(), > the "Normal" zone is never hitting the watermark and we constantly call > "goto loop_again" trying to "rebalance" all zones. > Confirmed. The following "patch" should fix allow the number of free pages to drop to a sensible level. Note, this is not intended as a fix because it's the utterly wrong approach to take. It's only to illustrate where things are going wrong when the top-most zone is very small. diff --git a/mm/vmscan.c b/mm/vmscan.c index f5d90de..477cb77 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2259,7 +2259,8 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, } if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), - classzone_idx, 0)) + classzone_idx, 0) && + zone->present_pages >= pgdat->node_present_pages >> 2) all_zones_ok = false; else balanced += zone->present_pages; @@ -2446,15 +2447,18 @@ loop_again: if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), end_zone, 0)) { - all_zones_ok = 0; - /* - * We are still under min water mark. This - * means that we have a GFP_ATOMIC allocation - * failure risk. Hurry up! - */ - if (!zone_watermark_ok_safe(zone, order, - min_wmark_pages(zone), end_zone, 0)) - has_under_min_watermark_zone = 1; + if (zone->present_pages >= pgdat->node_present_pages >> 2) { + all_zones_ok = 0; + + /* + * We are still under min water mark. This + * means that we have a GFP_ATOMIC allocation + * failure risk. Hurry up! + */ + if (!zone_watermark_ok_safe(zone, order, + min_wmark_pages(zone), end_zone, 0)) + has_under_min_watermark_zone = 1; + } } else { /* * If a zone reaches its high watermark, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-26 17:42 ` Mel Gorman @ 2011-01-27 13:40 ` Mel Gorman 2011-01-27 15:27 ` Andrea Arcangeli 0 siblings, 1 reply; 52+ messages in thread From: Mel Gorman @ 2011-01-27 13:40 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On Wed, Jan 26, 2011 at 05:42:37PM +0000, Mel Gorman wrote: > On Wed, Jan 26, 2011 at 04:36:55PM +0000, Mel Gorman wrote: > > > But the wmarks don't > > > seem the real offender, maybe it's something related to the tiny pci32 > > > zone that materialize on 4g systems that relocate some little memory > > > over 4g to make space for the pci32 mmio. I didn't yet finish to debug > > > it. > > > > > > > This has to be it. What I think is happening is that we're in balance_pgdat(), > > the "Normal" zone is never hitting the watermark and we constantly call > > "goto loop_again" trying to "rebalance" all zones. > > > > Confirmed. > <SNIP> How about the following? Functionally it would work but I am concerned that the logic in balance_pgdat() and kswapd() is getting out of hand having being adjusted to work with a number of corner cases already. In the next cycle, it could do with a "do-over" attempt to make it easier to follow. ==== CUT HERE ==== mm: kswapd: Do not reclaim excessive pages from already balanced zones When reclaiming for order-0 pages, kswapd requires that all zones be balanced. Each cycle through balance_pgdat() does background ageing on all zones if necessary and applies equal pressure on the inactive zone unless a lot of pages are free already. A "lot of free pages" is defined as 8*high_watermark which historically has been reasonably fine as min_free_kbytes was small. However, on systems using huge pages, it is recommended that min_free_kbytes is higher and it is tuned with hugeadm --set-recommended-min_free_kbytes. With the introduction of transparent huge page support, this recommended value is also applied. The problem then is in the corner cases. On X86-64 with 4G of memory, min_free_kbytes becomes 67584 so one would expect around 68M of memory to be free. The Normal zone is approximately 35000 pages so under even normal memory pressure such as copying a large file, it gets exhausted quickly. As it is getting exhausted, kswapd applies pressure equally to all zones, including the DMA32 zone. DMA32 is approximately 700,000 pages with a high watermark of around 23,000 pages. In this situation, kswapd will reclaim around (23000*8) pages or 718M of pages before the zone is ignored. What the user sees is kswapd constantly stuck in D state and free memory far higher than it should be. This patch addresses the problem by taking into account if kswapd is looping in balance_pgdat() when deciding if a zone is balanced or not. If the zone is relatively small and kswapd is looping or preparing to sleep, then the zone is considered balanced. If an allocator has hit the low watermark, kswapd will stay awake (pgdat->kswapd_max_order or classzone_idx) will be set and reread or will get woken later when real memory pressure exists. Using a very basic test of cp /dev/sda6 /dev/null where sda6 was an 80G partition, the amount of free memory without this patch hovered around the 700M mark and around the 90M mark when applied which is closer to expectations for the larger default min_free_kbytes with THP enabled. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- mm/vmscan.c | 44 ++++++++++++++++++++++++++++++++++++++------ 1 files changed, 38 insertions(+), 6 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index f5d90de..3d4ffd3 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2228,6 +2228,35 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages, return balanced_pages > (present_pages >> 2); } +static bool zone_balanced(struct zone *zone, int order, unsigned long mark, + int classzone_idx, bool firstscan) +{ + pg_data_t *pgdat = zone->zone_pgdat; + + /* + * If this is a relatively small zone and kswapd is looping + * for order-0 pages, consider the zone to be balanced so + * kswapd has a chance to go back to sleep. Direct reclaimers + * will wake kswapd again if necessary. Otherwise there is a + * risk that kswapd will reclaim an excessive number of pages + * from larger zones even when allocators do not require it + * due to balance_pgdat reclaiming pages from each zone unless + * free pages > 8*high_watermark which is potentially a large + * number of pages. + * + * Small is considered to be node_present_pages >> 2 due to + * the "free pages > 8*high_watermark" heuristic. The + * smallest possible low zone (DMA) and a small high zone + * should in combination be related to the maximum amount + * of memory kswapd will reclaim from the other zones. + */ + if (!firstscan && order == 0 && + zone->present_pages < pgdat->node_present_pages >> 2) + return true; + + return zone_watermark_ok_safe(zone, order, mark, classzone_idx, 0); +} + /* is kswapd sleeping prematurely? */ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, int classzone_idx) @@ -2258,8 +2287,8 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, continue; } - if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), - classzone_idx, 0)) + if (!zone_balanced(zone, order, high_wmark_pages(zone), + classzone_idx, false)) all_zones_ok = false; else balanced += zone->present_pages; @@ -2306,6 +2335,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, int i; int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ unsigned long total_scanned; + bool firstscan; struct reclaim_state *reclaim_state = current->reclaim_state; struct scan_control sc = { .gfp_mask = GFP_KERNEL, @@ -2444,16 +2474,16 @@ loop_again: total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2) sc.may_writepage = 1; - if (!zone_watermark_ok_safe(zone, order, - high_wmark_pages(zone), end_zone, 0)) { + if (!zone_balanced(zone, order, + high_wmark_pages(zone), end_zone, firstscan)) { all_zones_ok = 0; /* * We are still under min water mark. This * means that we have a GFP_ATOMIC allocation * failure risk. Hurry up! */ - if (!zone_watermark_ok_safe(zone, order, - min_wmark_pages(zone), end_zone, 0)) + if (!zone_balanced(zone, order, + min_wmark_pages(zone), end_zone, firstscan)) has_under_min_watermark_zone = 1; } else { /* @@ -2520,6 +2550,8 @@ out: if (sc.nr_reclaimed < SWAP_CLUSTER_MAX) order = sc.order = 0; + firstscan = false; + goto loop_again; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-27 13:40 ` Mel Gorman @ 2011-01-27 15:27 ` Andrea Arcangeli 2011-01-27 16:03 ` Mel Gorman ` (2 more replies) 0 siblings, 3 replies; 52+ messages in thread From: Andrea Arcangeli @ 2011-01-27 15:27 UTC (permalink / raw) To: Mel Gorman; +Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On Thu, Jan 27, 2011 at 01:40:58PM +0000, Mel Gorman wrote: > On Wed, Jan 26, 2011 at 05:42:37PM +0000, Mel Gorman wrote: > > On Wed, Jan 26, 2011 at 04:36:55PM +0000, Mel Gorman wrote: > > > > But the wmarks don't > > > > seem the real offender, maybe it's something related to the tiny pci32 > > > > zone that materialize on 4g systems that relocate some little memory > > > > over 4g to make space for the pci32 mmio. I didn't yet finish to debug > > > > it. > > > > > > > > > > This has to be it. What I think is happening is that we're in balance_pgdat(), > > > the "Normal" zone is never hitting the watermark and we constantly call > > > "goto loop_again" trying to "rebalance" all zones. > > > > > > > Confirmed. > > <SNIP> > > How about the following? Functionally it would work but I am concerned > that the logic in balance_pgdat() and kswapd() is getting out of hand > having being adjusted to work with a number of corner cases already. In > the next cycle, it could do with a "do-over" attempt to make it easier > to follow. That number 8 is the problem, I don't think anybody was ever supposed to free 8*highwmark pages. kswapd must work in the hysteresis range low->high area and then sleep wait low to hit again before it gets wakenup. Not sure how that number 8 ever come up... but to be it looks like the real offender and I wouldn't work around it. totally untested... I will test.... ==== Subject: vmscan: kswapd must not free more than high_wmark pages From: Andrea Arcangeli <aarcange@redhat.com> When the min_free_kbytes is set with `hugeadm --set-recommended-min_free_kbytes" or with THP enabled (which runs the equivalent of "hugeadm --set-recommended-min_free_kbytes" to activate anti-frag at full effectiveness automatically at boot) the high wmark of some zone is as high as ~88M. 88M free on a 4G system isn't horrible, but 88M*8 = 704M free on a 4G system is definitely unbearable. This only tends to be visible on 4G systems with tiny over-4g zone where kswapd insists to reach the high wmark on the over-4g zone but doing so it shrunk up to 704M from the normal zone by mistake. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- diff --git a/mm/vmscan.c b/mm/vmscan.c index f5d90de..9e3c78e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2407,7 +2407,7 @@ loop_again: * zone has way too many pages free already. */ if (!zone_watermark_ok_safe(zone, order, - 8*high_wmark_pages(zone), end_zone, 0)) + high_wmark_pages(zone), end_zone, 0)) shrink_zone(priority, zone, &sc); reclaim_state->reclaimed_slab = 0; nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-27 15:27 ` Andrea Arcangeli @ 2011-01-27 16:03 ` Mel Gorman 2011-01-27 18:52 ` Andrea Arcangeli 2011-02-03 2:58 ` Andrea Arcangeli 2011-02-12 9:48 ` alex shi 2 siblings, 1 reply; 52+ messages in thread From: Mel Gorman @ 2011-01-27 16:03 UTC (permalink / raw) To: Andrea Arcangeli Cc: Shaohua Li, Andrew Morton, Rik van Riel, linux-mm, Chen, Tim C On Thu, Jan 27, 2011 at 04:27:55PM +0100, Andrea Arcangeli wrote: > On Thu, Jan 27, 2011 at 01:40:58PM +0000, Mel Gorman wrote: > > On Wed, Jan 26, 2011 at 05:42:37PM +0000, Mel Gorman wrote: > > > On Wed, Jan 26, 2011 at 04:36:55PM +0000, Mel Gorman wrote: > > > > > But the wmarks don't > > > > > seem the real offender, maybe it's something related to the tiny pci32 > > > > > zone that materialize on 4g systems that relocate some little memory > > > > > over 4g to make space for the pci32 mmio. I didn't yet finish to debug > > > > > it. > > > > > > > > > > > > > This has to be it. What I think is happening is that we're in balance_pgdat(), > > > > the "Normal" zone is never hitting the watermark and we constantly call > > > > "goto loop_again" trying to "rebalance" all zones. > > > > > > > > > > Confirmed. > > > <SNIP> > > > > How about the following? Functionally it would work but I am concerned > > that the logic in balance_pgdat() and kswapd() is getting out of hand > > having being adjusted to work with a number of corner cases already. In > > the next cycle, it could do with a "do-over" attempt to make it easier > > to follow. > > That number 8 is the problem, Agreed, I considered your approach as well. I didn't go with it because it was the main heuristic that allowed kswapd to skip a zone but still allows kswapd to keep going. I made the choice to try and put kswapd to sleep sooner. > I don't think anybody was ever supposed > to free 8*highwmark pages. kswapd must work in the hysteresis range > low->high area and then sleep wait low to hit again before it gets > wakenup. Not sure how that number 8 ever come up... but to be it looks > like the real offender and I wouldn't work around it. > It was introduced by commit [32a4330d: mm: prevent kswapd from freeing excessive amounts of lowmem] and sure enough, it was intended to avoid a situation where memory was freed from every zone if one was imbalanced - sounds familiar. > totally untested... I will test.... > It should work in terms of free memory. When testing, monitor as well if kswapd is going asleep or if it is stuck in D state. If it's stuck in D state, it's looping around in balance_pgdat() and consuming CPU for no good reason (can use vmscan tracepoints to confirm). -- Mel Gorman Linux Technology Center IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-27 16:03 ` Mel Gorman @ 2011-01-27 18:52 ` Andrea Arcangeli 2011-01-27 20:33 ` Rik van Riel 2011-01-27 21:31 ` Mel Gorman 0 siblings, 2 replies; 52+ messages in thread From: Andrea Arcangeli @ 2011-01-27 18:52 UTC (permalink / raw) To: Mel Gorman; +Cc: Shaohua Li, Andrew Morton, Rik van Riel, linux-mm, Chen, Tim C On Thu, Jan 27, 2011 at 04:03:01PM +0000, Mel Gorman wrote: > Agreed, I considered your approach as well. I didn't go with it because it > was the main heuristic that allowed kswapd to skip a zone but still allows > kswapd to keep going. I made the choice to try and put kswapd to sleep > sooner. Ok, but a multiplication *8 remains excessive and while it may be ok with min_free_kbytes=20M it's not ok when it's = 80M, especially when it can be set to 80M on a 4G system that will end up with a small over-4g zone that may not be shrunk as easily as the normal/pci32 zone below 4g. It's broken because this *8 adds is all about a 7*highwmark "gap". I'm having a little trouble understanding your patch and I don't like the magic >> 2 very much, if the node has little more than 1/4th of the memory of the node, it'll still cause the other zones to be shrunk 8 times more than they should ever be shrunk! This will materialize with ~mem=5g , with your patch a little more than 5g will still lead to ~800M free by mistake. It seems more a band aid for the 4g case than a real fix. This is why I think the real fix is to remove that *8 and create a real "balance gap ratio" that is in function of the memory of the zone, not in function of the high wmark at all. If we were using the old code the gap would be way smaller. The "gap" is increasing excessively because the "high wmark" is increasing to a fixed value in function of the pageblocks numbers, the migrate types etc..., but from an algorithm point of view the high wmark has no effect on the rotation of all lrus to balance the shrinking of all zones. The high wmark is a fixed amount for all zones, the "gap" doesn't need to increase with the high wmark. Clearly the high wmark was used as in the old days it was a function of the ram size, now it's not anymore. So clearly the "gap" must not be in function of the high wmark a nymore but only in function of the memory size! Which I think is the real fix. > It was introduced by commit [32a4330d: mm: prevent kswapd from freeing > excessive amounts of lowmem] and sure enough, it was intended to avoid a > situation where memory was freed from every zone if one was imbalanced - > sounds familiar. Yes definitely. So it was limiting the waste to 8*high_wmark. But that was ok because it had the assumtion wmark was a fuction of memory, it's not ok anymore and we must make it a function of memory explicitly to fix this. > It should work in terms of free memory. When testing, monitor as well if > kswapd is going asleep or if it is stuck in D state. If it's stuck in D state, > it's looping around in balance_pgdat() and consuming CPU for no good reason > (can use vmscan tracepoints to confirm). I'll try another patch first to avoid disabling the balancing of all zones that should provide for a nicer lru behavior than my previous patch. I am however uncertain this is really better than removing the *8 as in my previous patch. But either this or previous patch I sent is the solution I prefer, because this fixes it without a magic >>2 that will break again quite badly at little more than mem=5g. ==== Subject: vmscan: kswapd must not free more than high_wmark+gap pages From: Andrea Arcangeli <aarcange@redhat.com> When the min_free_kbytes is set with `hugeadm --set-recommended-min_free_kbytes" or with THP enabled (which runs the equivalent of "hugeadm --set-recommended-min_free_kbytes" to activate anti-frag at full effectiveness automatically at boot) the high wmark of some zone is fixed as high as ~88M, not anymore in function of memory size. 88M free on a 4G system isn't horrible, but 88M*8 = 704M free on a 4G system is unbearable. This only tends to be visible on 4G systems with tiny over-4g zone where kswapd insists to reach the high wmark on the over-4g zone but doing so it shrunk up to 704M from the normal zone by mistake. This patch makes the "gap" explicit in function of memory size, because the high wmark isn't in function of memory size anymore. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- diff --git a/include/linux/swap.h b/include/linux/swap.h index 4d55932..a57c6e7 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -155,6 +155,15 @@ enum { #define SWAP_CLUSTER_MAX 32 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX +/* + * Ratio between the present memory in the zone and the "gap" that + * we're allowing kswapd to shrink in addition to the per-zone high + * wmark, even for zones that already have the high wmark satisfied, + * in order to provide better per-zone lru behavior. We are ok to + * spend not more than 1% of the memory for this zone balancing "gap". + */ +#define KSWAPD_ZONE_BALANCE_GAP_RATIO 100 + #define SWAP_MAP_MAX 0x3e /* Max duplication count, in first swap_map */ #define SWAP_MAP_BAD 0x3f /* Note pageblock is bad, in first swap_map */ #define SWAP_HAS_CACHE 0x40 /* Flag page is cached, in first swap_map */ diff --git a/mm/vmscan.c b/mm/vmscan.c index f5d90de..f03441e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2403,11 +2403,16 @@ loop_again: mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask); /* - * We put equal pressure on every zone, unless one - * zone has way too many pages free already. + * We put equal pressure on every zone, unless + * one zone has way too many pages free + * already. The "too many pages" is defined + * as the high wmark plus a "gap". */ if (!zone_watermark_ok_safe(zone, order, - 8*high_wmark_pages(zone), end_zone, 0)) + (zone->present_pages + + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) / + KSWAPD_ZONE_BALANCE_GAP_RATIO + + high_wmark_pages(zone), end_zone, 0)) shrink_zone(priority, zone, &sc); reclaim_state->reclaimed_slab = 0; nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-27 18:52 ` Andrea Arcangeli @ 2011-01-27 20:33 ` Rik van Riel 2011-01-27 21:31 ` Mel Gorman 1 sibling, 0 replies; 52+ messages in thread From: Rik van Riel @ 2011-01-27 20:33 UTC (permalink / raw) To: Andrea Arcangeli Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On 01/27/2011 01:52 PM, Andrea Arcangeli wrote: > if (!zone_watermark_ok_safe(zone, order, > - 8*high_wmark_pages(zone), end_zone, 0)) > + (zone->present_pages + > + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) / > + KSWAPD_ZONE_BALANCE_GAP_RATIO + > + high_wmark_pages(zone), end_zone, 0)) > shrink_zone(priority, zone,&sc); Isn't (zone->present_pages + 99) / 100 + high_wmark_pages(zone) pretty much guaranteed to be significantly larger than the 8 times the high watermark we had before? -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-27 18:52 ` Andrea Arcangeli 2011-01-27 20:33 ` Rik van Riel @ 2011-01-27 21:31 ` Mel Gorman 2011-01-27 23:18 ` Rik van Riel 1 sibling, 1 reply; 52+ messages in thread From: Mel Gorman @ 2011-01-27 21:31 UTC (permalink / raw) To: Andrea Arcangeli Cc: Shaohua Li, Andrew Morton, Rik van Riel, linux-mm, Chen, Tim C On Thu, Jan 27, 2011 at 07:52:15PM +0100, Andrea Arcangeli wrote: > On Thu, Jan 27, 2011 at 04:03:01PM +0000, Mel Gorman wrote: > > Agreed, I considered your approach as well. I didn't go with it because it > > was the main heuristic that allowed kswapd to skip a zone but still allows > > kswapd to keep going. I made the choice to try and put kswapd to sleep > > sooner. > > Ok, but a multiplication *8 remains excessive and while it may be ok > with min_free_kbytes=20M it's not ok when it's = 80M, especially when > it can be set to 80M on a 4G system that will end up with a small > over-4g zone that may not be shrunk as easily as the normal/pci32 zone > below 4g. > Agreed on this front at least. > It's broken because this *8 adds is all about a 7*highwmark "gap". > The gap as a multiple is not so much as how much of a gap that works out as being. > I'm having a little trouble understanding your patch and I don't like > the magic >> 2 very much, if the node has little more than 1/4th of if the zone has little more than 1/4th I assume you mean. > the memory of the node, it'll still cause the other zones to be shrunk > 8 times more than they should ever be shrunk! This will materialize > with ~mem=5g , with your patch a little more than 5g will still lead > to ~800M free by mistake. You're right that 5G would lead to the Normal zone being slightly above the quarter mark. Initially I considered that a 1G zone would remain balanced for long enough for kswapd to go to sleep but now that I consider it more it's not safe. It might work on one machine and fail on a faster on making it hard to pin down. > It seems more a band aid for the 4g case > than a real fix. This is why I think the real fix is to remove that *8 > and create a real "balance gap ratio" that is in function of the > memory of the zone, not in function of the high wmark at all. > > If we were using the old code the gap would be way smaller. The "gap" > is increasing excessively because the "high wmark" is increasing to a > fixed value in function of the pageblocks numbers, the migrate types > etc..., but from an algorithm point of view the high wmark has no > effect on the rotation of all lrus to balance the shrinking of all > zones. The high wmark is a fixed amount for all zones, the "gap" > doesn't need to increase with the high wmark. > Ok, that would be a mild improvement but what value should that gap be? If it's a plain percentage of the zone, it could still become an extremely large value. Conceivably it would be better to rely on an event from the page allocator. Specifically, if the allocator has not complained that this node is under pressure recently as indicated from calls to wakeup_kswapd() then stop reclaiming from any zone that meets the watermark. > Clearly the high wmark was used as in the old days it was a function > of the ram size, now it's not anymore. So clearly the "gap" must not > be in function of the high wmark a nymore but only in function of the > memory size! Which I think is the real fix. > > > It was introduced by commit [32a4330d: mm: prevent kswapd from freeing > > excessive amounts of lowmem] and sure enough, it was intended to avoid a > > situation where memory was freed from every zone if one was imbalanced - > > sounds familiar. > > Yes definitely. So it was limiting the waste to 8*high_wmark. But that > was ok because it had the assumtion wmark was a fuction of memory, > it's not ok anymore and we must make it a function of memory > explicitly to fix this. > hmm, admittedly a gap that was a function of memory would limit the damage but it doesn't prevent a situation where a really small Normal zone can prevent kswapd going to sleep. i.e. when I get to testing your patch (hopefully tomorrow, tuesday at worst), I'll be looking for kswapd being stuck in D state. > > It should work in terms of free memory. When testing, monitor as well if > > kswapd is going asleep or if it is stuck in D state. If it's stuck in D state, > > it's looping around in balance_pgdat() and consuming CPU for no good reason > > (can use vmscan tracepoints to confirm). > > I'll try another patch first to avoid disabling the balancing of all > zones that should provide for a nicer lru behavior than my previous > patch. > > I am however uncertain this is really better than removing the *8 as > in my previous patch. But either this or previous patch I sent is the > solution I prefer, because this fixes it without a magic >>2 that will > break again quite badly at little more than mem=5g. > Whatever the final solution, it both needs to prevent too much memory being reclaimed and allow kswapd to go to sleep if there is no indication from the page allocator that it should stay awake. > ==== > Subject: vmscan: kswapd must not free more than high_wmark+gap pages > > From: Andrea Arcangeli <aarcange@redhat.com> > > When the min_free_kbytes is set with `hugeadm > --set-recommended-min_free_kbytes" or with THP enabled (which runs the > equivalent of "hugeadm --set-recommended-min_free_kbytes" to activate > anti-frag at full effectiveness automatically at boot) the high wmark > of some zone is fixed as high as ~88M, not anymore in function of > memory size. 88M free on a 4G system isn't horrible, but 88M*8 = 704M > free on a 4G system is unbearable. This only tends to be visible on 4G At the very least, we agree on what is causing this problem :) > systems with tiny over-4g zone where kswapd insists to reach the high > wmark on the over-4g zone but doing so it shrunk up to 704M from the > normal zone by mistake. This patch makes the "gap" explicit in > function of memory size, because the high wmark isn't in function of > memory size anymore. > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> > --- > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 4d55932..a57c6e7 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -155,6 +155,15 @@ enum { > #define SWAP_CLUSTER_MAX 32 > #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX > > +/* > + * Ratio between the present memory in the zone and the "gap" that > + * we're allowing kswapd to shrink in addition to the per-zone high > + * wmark, even for zones that already have the high wmark satisfied, > + * in order to provide better per-zone lru behavior. We are ok to > + * spend not more than 1% of the memory for this zone balancing "gap". > + */ > +#define KSWAPD_ZONE_BALANCE_GAP_RATIO 100 > + > #define SWAP_MAP_MAX 0x3e /* Max duplication count, in first swap_map */ > #define SWAP_MAP_BAD 0x3f /* Note pageblock is bad, in first swap_map */ > #define SWAP_HAS_CACHE 0x40 /* Flag page is cached, in first swap_map */ > diff --git a/mm/vmscan.c b/mm/vmscan.c > index f5d90de..f03441e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2403,11 +2403,16 @@ loop_again: > mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask); > > /* > - * We put equal pressure on every zone, unless one > - * zone has way too many pages free already. > + * We put equal pressure on every zone, unless > + * one zone has way too many pages free > + * already. The "too many pages" is defined > + * as the high wmark plus a "gap". > */ > if (!zone_watermark_ok_safe(zone, order, > - 8*high_wmark_pages(zone), end_zone, 0)) > + (zone->present_pages + > + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) / > + KSWAPD_ZONE_BALANCE_GAP_RATIO + > + high_wmark_pages(zone), end_zone, 0)) Rik has already pointed out that this potentially is a very large gap but that is an addressable problem if the final decision goes this direction. > shrink_zone(priority, zone, &sc); > reclaim_state->reclaimed_slab = 0; > nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, > -- Mel Gorman Linux Technology Center IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-27 21:31 ` Mel Gorman @ 2011-01-27 23:18 ` Rik van Riel 2011-01-28 10:35 ` Mel Gorman 0 siblings, 1 reply; 52+ messages in thread From: Rik van Riel @ 2011-01-27 23:18 UTC (permalink / raw) To: Mel Gorman Cc: Andrea Arcangeli, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On 01/27/2011 04:31 PM, Mel Gorman wrote: > Whatever the final solution, it both needs to prevent too much memory > being reclaimed and allow kswapd to go to sleep if there is no > indication from the page allocator that it should stay awake. A third requirement: If one zone has a lot lower memory pressure than another zone, we want to do relatively more memory allocations from that zone, than from a zone where the memory is heavily used. If kswapd only ever goes up to the high watermark, and also uses that as its sleep point, the allocations end up corresponding to zone size alone and not to memory pressure. Going a little bit above the high watermark (1% of zone size? high + min watermark?) will help balance things out between zones. >> if (!zone_watermark_ok_safe(zone, order, >> - 8*high_wmark_pages(zone), end_zone, 0)) >> + (zone->present_pages + >> + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) / >> + KSWAPD_ZONE_BALANCE_GAP_RATIO + >> + high_wmark_pages(zone), end_zone, 0)) > > Rik has already pointed out that this potentially is a very large gap > but that is an addressable problem if the final decision goes this > direction. I was wrong. I guess on some systems the min watermark can be less than 1% and (high + min) may be better, but on most systems the number of pages should be about the same. Maybe we should use high_wmark_pages(zone) + low_wmark_pages(zone) for easy readability? -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-27 23:18 ` Rik van Riel @ 2011-01-28 10:35 ` Mel Gorman 2011-01-28 16:28 ` Andrea Arcangeli 0 siblings, 1 reply; 52+ messages in thread From: Mel Gorman @ 2011-01-28 10:35 UTC (permalink / raw) To: Rik van Riel Cc: Andrea Arcangeli, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On Thu, Jan 27, 2011 at 06:18:07PM -0500, Rik van Riel wrote: > On 01/27/2011 04:31 PM, Mel Gorman wrote: > >> Whatever the final solution, it both needs to prevent too much memory >> being reclaimed and allow kswapd to go to sleep if there is no >> indication from the page allocator that it should stay awake. > > A third requirement: > > If one zone has a lot lower memory pressure than another zone, > we want to do relatively more memory allocations from that zone, > than from a zone where the memory is heavily used. > Risky. Allocations could end up using a lower zone than required causing a form of lowmem pressure when highmem should have been used. Worse, it'll be unnoticable on x86-64 but potentially cause problems on x86-32 that are easily missed. > If kswapd only ever goes up to the high watermark, and also uses > that as its sleep point, the allocations end up corresponding to > zone size alone and not to memory pressure. > hmm. > Going a little bit above the high watermark (1% of zone size? > high + min watermark?) will help balance things out between zones. > >>> if (!zone_watermark_ok_safe(zone, order, >>> - 8*high_wmark_pages(zone), end_zone, 0)) >>> + (zone->present_pages + >>> + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) / >>> + KSWAPD_ZONE_BALANCE_GAP_RATIO + >>> + high_wmark_pages(zone), end_zone, 0)) >> >> Rik has already pointed out that this potentially is a very large gap >> but that is an addressable problem if the final decision goes this >> direction. > > I was wrong. I guess on some systems the min watermark can be less > than 1% and (high + min) may be better, but on most systems the > number of pages should be about the same. > > Maybe we should use high_wmark_pages(zone) + low_wmark_pages(zone) > for easy readability? > I'd be ok with high+low as a starting point to solve the immediate problem of way too much memory being free and then treat "kswapd must go to sleep" as a separate problem. I'm less keen on 1% but only because it could be too large a value. -- Mel Gorman Linux Technology Center IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-28 10:35 ` Mel Gorman @ 2011-01-28 16:28 ` Andrea Arcangeli 2011-01-28 16:46 ` Mel Gorman 2011-01-28 17:10 ` Rik van Riel 0 siblings, 2 replies; 52+ messages in thread From: Andrea Arcangeli @ 2011-01-28 16:28 UTC (permalink / raw) To: Mel Gorman; +Cc: Rik van Riel, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On Fri, Jan 28, 2011 at 10:35:39AM +0000, Mel Gorman wrote: > I'd be ok with high+low as a starting point to solve the immediate > problem of way too much memory being free and then treat "kswapd must go > to sleep" as a separate problem. I'm less keen on 1% but only because it > could be too large a value. min(1%, low) sounds best to me. Because on the 4G system "low" is likely bigger than 1%. But really to me it sounds best to apply my first patch and stick to the high watermark and remove the gap. What is going on is dma zone and pci32 zones are at high+gap. over-4g zone is at "high". kswapd keeps running until all are above high. But as long as there's at least one not over high, the others are shrunk up to high+gap. The allocator is tought that it should try to always allocate from the over4g zone. And the over-4g zone is never below the "low" wmark because 100% of the cache is clean so kswapd keeps the normal and dma zones at high+gap and the over-4g zone at "high". In previous email you asked me how kswapd get stuck in D state and never stops working, and that it should stop earlier. This sounds impossible, kswapd behavior can't possibly change, simply there is less memory freed by lowering that "gap". Also you can make the gap as big as you want but it'll only make a difference the first time, then kswapd will stop shrinking normal and dma zone when they reach high+gap. Regardless of the gap size. So kswapd can't possibly change behavior and it can't possibly be in D state by just changing this "gap" size. Which is why I think the gap should be zero and I'd like my first patch to be applied. There's no point to waste ram for a feature that can't gaurantee we rotate the zone allocation. The balancing problem can't be solved in kswapd. It can only be solved in the allocator if you really aim to give more rotation to the lrus. As long as the "over4g" zone will be allocated first, at some point the lrus in the normal/dma zone will have to stop rotating. Either that or kswapd will shrink 100% of the ram in dma/normal zone which would destroy all the cache which is clearly wrong. And if you change the allocator to allocate in rotation from the 3 zones (clearly we would never want to allocate from the dma zone, so it's magic area here) there is absolutely no need of any "gap" in kswapd to keep the shrinking balanced. In short I think the zone balancing problem tackled in kswapd is wrong and kswapd should stick to the high wmark only, and if you care about zone balancing it should be done in the allocator only, then kswapd will cope with whatever the allocator decides just fine. I guess the LRU caching behavior of a 4g system with a little memory over 4g is going to be worse than if you boot with mem=4g and there's nothing kswapd can do about it as long as the allocator always grabs the new cache page from the highest zone. Clearly on a 64bit system allocating below 4g may be ok, but on 32bit system allocating in the normal zone below 800m must be absolutely avoided. So it's not simple problem. Personally I never liked per-zone lru because of this. But kswapd isn't the solution and it just wastes memory with no benefit possible except for the first 5sec when the free memory goes up from 170M to 700M and then it remains stuck at 700M while cp runs for another 2 hours to read all 500G of hd. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-28 16:28 ` Andrea Arcangeli @ 2011-01-28 16:46 ` Mel Gorman 2011-01-28 17:16 ` Rik van Riel 2011-01-28 17:34 ` Andrea Arcangeli 2011-01-28 17:10 ` Rik van Riel 1 sibling, 2 replies; 52+ messages in thread From: Mel Gorman @ 2011-01-28 16:46 UTC (permalink / raw) To: Andrea Arcangeli Cc: Rik van Riel, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On Fri, Jan 28, 2011 at 05:28:31PM +0100, Andrea Arcangeli wrote: > On Fri, Jan 28, 2011 at 10:35:39AM +0000, Mel Gorman wrote: > > I'd be ok with high+low as a starting point to solve the immediate > > problem of way too much memory being free and then treat "kswapd must go > > to sleep" as a separate problem. I'm less keen on 1% but only because it > > could be too large a value. > > min(1%, low) sounds best to me. Because on the 4G system "low" is likely > bigger than 1%. > On a 4G system, sure. On a 16G system, the gap is larger than min_free_kbytes. Granted, in that case it's less of a problem because we don't have a small higher zone causing problems. > But really to me it sounds best to apply my first patch and stick to > the high watermark and remove the gap. > > What is going on is dma zone and pci32 zones are at high+gap. over-4g > zone is at "high". kswapd keeps running until all are above high. But > as long as there's at least one not over high, the others are shrunk > up to high+gap. > Yep, this is why there is an excess of free memory and kswapd stuck in D state as it's stuck in balance_pgdat(). > The allocator is tought that it should try to always allocate from the > over4g zone. And the over-4g zone is never below the "low" wmark > because 100% of the cache is clean so kswapd keeps the normal and dma > zones at high+gap and the over-4g zone at "high". > A consequence of this is that it's much harder for pages in a small high zone to get old while kswapd stays awake. They get reclaimed far sooner than pages in the Normal soon which no doubt leads to some unexpected slowdowns. It's another reason why we should be making sure kswapd gets to sleep when there is no pressure. > In previous email you asked me how kswapd get stuck in D state and > never stops working, and that it should stop earlier. This sounds > impossible, kswapd behavior can't possibly change, simply there is > less memory freed by lowering that "gap". There might be less memory freed by lowering that gap but it still needs to exit balance_pgdat() and go to sleep. Otherwise it'll keep freeing zones up to the high watermark + gap and calling congestion_wait (hence the D state). > Also you can make the gap as > big as you want but it'll only make a difference the first time, then > kswapd will stop shrinking normal and dma zone when they reach > high+gap. Regardless of the gap size. So kswapd can't possibly change > behavior and it can't possibly be in D state by just changing this > "gap" size. Which is why I think the gap should be zero and I'd like > my first patch to be applied. There's no point to waste ram for a > feature that can't gaurantee we rotate the zone allocation. > Ok, the gap idea will certainly work in that there will be less memory freed. It's the first obvious problem and it's the best solution so far. I will double check myself later if kswapd is stuck in D state due to looping around balance_pgdat(). > The balancing problem can't be solved in kswapd. It can only be solved > in the allocator if you really aim to give more rotation to the > lrus. As long as the "over4g" zone will be allocated first, at some > point the lrus in the normal/dma zone will have to stop > rotating. Either that or kswapd will shrink 100% of the ram in > dma/normal zone which would destroy all the cache which is clearly > wrong. > > And if you change the allocator to allocate in rotation from the 3 > zones (clearly we would never want to allocate from the dma zone, so > it's magic area here) there is absolutely no need of any "gap" in > kswapd to keep the shrinking balanced. > Rotating through the zones is no problem to implement. The expected problem is that allocations that could use HighMem or Normal instead use DMA32 potentially causing a request that requires DMA32 to fail later. > In short I think the zone balancing problem tackled in kswapd is wrong > and kswapd should stick to the high wmark only, and if you care about > zone balancing it should be done in the allocator only, then kswapd > will cope with whatever the allocator decides just fine. > Potentially. We'd need to be careful that allocation requests are not getting stalled but it's worth investigating. > I guess the LRU caching behavior of a 4g system with a little memory > over 4g is going to be worse than if you boot with mem=4g and there's > nothing kswapd can do about it as long as the allocator always grabs > the new cache page from the highest zone. Agreed. > Clearly on a 64bit system > allocating below 4g may be ok, but on 32bit system allocating in the > normal zone below 800m must be absolutely avoided. So it's not simple > problem. Exactly. > Personally I never liked per-zone lru because of this. But > kswapd isn't the solution and it just wastes memory with no benefit > possible except for the first 5sec when the free memory goes up from > 170M to 700M and then it remains stuck at 700M while cp runs for > another 2 hours to read all 500G of hd. > :/ -- Mel Gorman Linux Technology Center IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-28 16:46 ` Mel Gorman @ 2011-01-28 17:16 ` Rik van Riel 2011-01-28 17:46 ` Andrea Arcangeli 2011-01-28 17:34 ` Andrea Arcangeli 1 sibling, 1 reply; 52+ messages in thread From: Rik van Riel @ 2011-01-28 17:16 UTC (permalink / raw) To: Mel Gorman Cc: Andrea Arcangeli, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On 01/28/2011 11:46 AM, Mel Gorman wrote: > On Fri, Jan 28, 2011 at 05:28:31PM +0100, Andrea Arcangeli wrote: >> In previous email you asked me how kswapd get stuck in D state and >> never stops working, and that it should stop earlier. This sounds >> impossible, kswapd behavior can't possibly change, simply there is >> less memory freed by lowering that "gap". > > There might be less memory freed by lowering that gap but it still needs to > exit balance_pgdat() and go to sleep. Otherwise it'll keep freeing zones up > to the high watermark + gap and calling congestion_wait (hence the D state). The gap works because kswapd has different thresholds for different things: 1) get woken up if every zone on an allocator's zone list is below the low watermark 2) exit the loop if _every_ zone is at or above the high watermark 3) skip a zone in the freeing loop if the zone has more than high + gap free memory Continuing the loop as long as one zone is below the low watermark is what equalizes memory pressure between zones. Skipping the freeing of pages in a zone that already has excessive amounts of free memory helps avoid memory waste and excessive swapping. We simply equalize the balance between zones a little more slowly. What matters is that the memory pressure gets equalized over time. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-28 17:16 ` Rik van Riel @ 2011-01-28 17:46 ` Andrea Arcangeli 2011-01-28 18:03 ` Rik van Riel 0 siblings, 1 reply; 52+ messages in thread From: Andrea Arcangeli @ 2011-01-28 17:46 UTC (permalink / raw) To: Rik van Riel; +Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On Fri, Jan 28, 2011 at 12:16:19PM -0500, Rik van Riel wrote: > On 01/28/2011 11:46 AM, Mel Gorman wrote: > > On Fri, Jan 28, 2011 at 05:28:31PM +0100, Andrea Arcangeli wrote: > > >> In previous email you asked me how kswapd get stuck in D state and > >> never stops working, and that it should stop earlier. This sounds > >> impossible, kswapd behavior can't possibly change, simply there is > >> less memory freed by lowering that "gap". > > > > There might be less memory freed by lowering that gap but it still needs to > > exit balance_pgdat() and go to sleep. Otherwise it'll keep freeing zones up > > to the high watermark + gap and calling congestion_wait (hence the D state). > > The gap works because kswapd has different thresholds for > different things: > > 1) get woken up if every zone on an allocator's zone list > is below the low watermark > > 2) exit the loop if _every_ zone is at or above the > high watermark > > 3) skip a zone in the freeing loop if the zone has more > than high + gap free memory Exactly. > > Continuing the loop as long as one zone is below the low > watermark is what equalizes memory pressure between zones. I think you meant below high wmark here. > Skipping the freeing of pages in a zone that already has > excessive amounts of free memory helps avoid memory waste > and excessive swapping. We simply equalize the balance > between zones a little more slowly. What matters is that > the memory pressure gets equalized over time. The main problem I could see is for the lowmem reserve ratio. The only real wmark that will be relevant to the allocator will be the one of the "exact" zone asked to the allocator, not the below zones because of the reserve ratio. So then kswapd will only satisfy the high wmark from the view of the caller for the "exact" zone asked (not the below zones that also must take the lowmem reserve ratio into account). Which is enough but kswapd isn't helping the allocator for the below zones. In any case the gap won't ever be as big as the reserve ratio of the lower zones, so it can't solve this regardless with the gap. Probably what we have right now is already optimal so to put more shrinking pressure on the highest zone asked. Overall I don't see the point of the gap as it's just like setting the below zone wmark higher and I doubt it makes a significant balancing difference. But hey I'm also ok to keep the gap above zero, I just feel it's wasted memory. Surely it should be easy to prove it's wasted memory for the "cp /dev/sda /dev/null" workload on a 4g system with a little ram above 4g. For mixed workloads things are little more interesting but I think on average it's not worth it. My whole point in claiming it can't affect the balancing of the lrus, is that the real lru rotation is entirely controlled by the allocator. It doesn't matter if kswapd stops at high or high+gap, for any zone at any time, as long as the allocator only allocates from one zone or the other. And if the allocator allocates from all zones in a perfectly balanced way, again kswapd will shrink in a perfectly balanced way over time regardless of high or high+gap. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-28 17:46 ` Andrea Arcangeli @ 2011-01-28 18:03 ` Rik van Riel 2011-01-28 18:24 ` Andrea Arcangeli 0 siblings, 1 reply; 52+ messages in thread From: Rik van Riel @ 2011-01-28 18:03 UTC (permalink / raw) To: Andrea Arcangeli Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On 01/28/2011 12:46 PM, Andrea Arcangeli wrote: > My whole point in claiming it can't affect the balancing of the lrus, > is that the real lru rotation is entirely controlled by the > allocator. It doesn't matter if kswapd stops at high or high+gap, for > any zone at any time, as long as the allocator only allocates from one > zone or the other. And if the allocator allocates from all zones in a > perfectly balanced way, again kswapd will shrink in a perfectly > balanced way over time regardless of high or high+gap. My point is, the behaviour you describe would be WRONG :) The reason is that the different zones can contain data that is either heavily used or rarely used, often some mixture of the two, but sometimes the zones are out of balance in how much the data in memory gets touched. We need to reclaim and reuse the lightly used memory a little faster than the heavily used memory, to even out the memory pressure between zones. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-28 18:03 ` Rik van Riel @ 2011-01-28 18:24 ` Andrea Arcangeli 2011-01-28 19:34 ` Rik van Riel 0 siblings, 1 reply; 52+ messages in thread From: Andrea Arcangeli @ 2011-01-28 18:24 UTC (permalink / raw) To: Rik van Riel; +Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On Fri, Jan 28, 2011 at 01:03:50PM -0500, Rik van Riel wrote: > My point is, the behaviour you describe would be WRONG :) > > The reason is that the different zones can contain data > that is either heavily used or rarely used, often some > mixture of the two, but sometimes the zones are out of > balance in how much the data in memory gets touched. > > We need to reclaim and reuse the lightly used memory > a little faster than the heavily used memory, to even > out the memory pressure between zones. I've no idea how kswapd can reclaim the lightly used memory a little faster when it blocks at high+gap. Unless the allocator is eating into the gap, kswapd will be stuck at 700M free, and no rotation in the lru will ever happen in the lower zones. You can't control it from kswapd but only from the allocator and regardless the size of the gap the rotation won't alter. As eventually in the "cp /dev/sda /dev/null" example workload (but simulating what happens normally during any file read) the "high+gap" will be reached in 5 sec then it'll be like if there's no gap for the next 2 hours. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-28 18:24 ` Andrea Arcangeli @ 2011-01-28 19:34 ` Rik van Riel 2011-01-28 19:45 ` Andrea Arcangeli 0 siblings, 1 reply; 52+ messages in thread From: Rik van Riel @ 2011-01-28 19:34 UTC (permalink / raw) To: Andrea Arcangeli Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On 01/28/2011 01:24 PM, Andrea Arcangeli wrote: > On Fri, Jan 28, 2011 at 01:03:50PM -0500, Rik van Riel wrote: >> My point is, the behaviour you describe would be WRONG :) >> >> The reason is that the different zones can contain data >> that is either heavily used or rarely used, often some >> mixture of the two, but sometimes the zones are out of >> balance in how much the data in memory gets touched. >> >> We need to reclaim and reuse the lightly used memory >> a little faster than the heavily used memory, to even >> out the memory pressure between zones. > > I've no idea how kswapd can reclaim the lightly used memory a little > faster when it blocks at high+gap. It will block at high+gap only when one zone has really easily reclaimable memory, and another zone has difficult to free memory. That creates a free memory differential between the easy to free and difficult to free memory zones. If memory in all zones is equally easy to free, kswapd will go to sleep once the high watermark is reached in every zone. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-28 19:34 ` Rik van Riel @ 2011-01-28 19:45 ` Andrea Arcangeli 2011-01-28 20:55 ` Rik van Riel 0 siblings, 1 reply; 52+ messages in thread From: Andrea Arcangeli @ 2011-01-28 19:45 UTC (permalink / raw) To: Rik van Riel; +Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On Fri, Jan 28, 2011 at 02:34:31PM -0500, Rik van Riel wrote: > It will block at high+gap only when one zone has really > easily reclaimable memory, and another zone has difficult > to free memory. The other zone doesn't need to be difficult to free up. All ram in immediately freeable clean cache is the most common case there is. And it's more than enough to trigger the scenario in prev email. > That creates a free memory differential between the > easy to free and difficult to free memory zones. There's no difficult to free zone in this scenario. > If memory in all zones is equally easy to free, kswapd > will go to sleep once the high watermark is reached in > every zone. Yes, at that point the high wmark is reached for all zones. Then cp or any file read allocates another high-low amount of clean cache, and kswapd will be waken again. Then when it goes to sleep the over4g tiny zone will be at "high" again but the below zones will be at high+(high_over4gwmark-low_over4gwmark), in about 5 seconds the over4g zone will be at "high" and the other two zones will be at "high+gap". All when there's zero memory pressure in the below zones, and there's just some clean cache shrinking required to allocate the new cache from the over4g zone. Then the below zones lru stops rotating regardless of the size of the gap (0 or 600M makes no difference). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-28 19:45 ` Andrea Arcangeli @ 2011-01-28 20:55 ` Rik van Riel 2011-01-29 19:45 ` Andrea Arcangeli 0 siblings, 1 reply; 52+ messages in thread From: Rik van Riel @ 2011-01-28 20:55 UTC (permalink / raw) To: Andrea Arcangeli Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On 01/28/2011 02:45 PM, Andrea Arcangeli wrote: > On Fri, Jan 28, 2011 at 02:34:31PM -0500, Rik van Riel wrote: >> It will block at high+gap only when one zone has really >> easily reclaimable memory, and another zone has difficult >> to free memory. > > The other zone doesn't need to be difficult to free up. All ram in > immediately freeable clean cache is the most common case there is. And > it's more than enough to trigger the scenario in prev email. > >> That creates a free memory differential between the >> easy to free and difficult to free memory zones. > > There's no difficult to free zone in this scenario. In that case, every zone will go down to the low watermark before kswapd is woken up. At that point, kswapd will reclaim until every zone is at the high watermark, and go back to sleep. There is no "free up to high + gap" in your scenario. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-28 20:55 ` Rik van Riel @ 2011-01-29 19:45 ` Andrea Arcangeli 0 siblings, 0 replies; 52+ messages in thread From: Andrea Arcangeli @ 2011-01-29 19:45 UTC (permalink / raw) To: Rik van Riel; +Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On Fri, Jan 28, 2011 at 03:55:09PM -0500, Rik van Riel wrote: > In that case, every zone will go down to the low watermark > before kswapd is woken up. This isn't what happens though, if that would be what happens, we would see free memory going down back to ~130M and then up to 700M and then down again to 130M, and not stuck at 700M at all times like below. Example: 0 0 70512 134940 379408 2753936 0 0 118 71 5 3 2 1 97 1 0 0 70512 134808 379408 2753936 0 0 0 0 54 48 0 0 100 0 0 1 70512 131228 383448 2753928 0 0 4160 68 149 172 0 0 99 1 0 1 70512 276548 502184 2495564 0 0 118784 36 1357 2084 0 5 73 21 1 1 70512 507932 624128 2151616 0 0 121984 0 1521 2166 0 6 77 17 0 1 70512 699264 746484 1860468 0 0 122368 4 1443 2242 0 5 74 20 0 1 70512 727040 865936 1722716 0 0 119552 0 1344 2194 0 5 75 21 0 1 70512 733116 984396 1610292 0 0 118528 0 1311 2139 0 4 76 20 1 0 70512 724064 1102864 1510256 0 0 118528 0 1302 2132 0 4 75 21 1 0 70512 728900 1224312 1394328 0 0 121472 0 1395 2168 0 4 77 19 1 0 70512 733736 1337224 1286852 0 0 115840 40 1404 2074 0 4 74 22 > At that point, kswapd will reclaim until every zone is at > the high watermark, and go back to sleep. > > There is no "free up to high + gap" in your scenario. Well there clearly is from vmstat... I think you should be able to reproduce if you boot with something like mem=4200m or so, workload is simple "cp /dev/sda /dev/null". Maybe we're waking kswapd too soon. But kswapd definitely goes to sleep, infact it sleeps most of the time and it runs every once in a while and it's unclear why the free memory never reaches back the 130M level that it usually sits when there's no intensive read I/O like shown above. For now, given what I see, I have to assume kswapd is waken too soon, and not only when all wmarks reach low or the free memory wouldn't be stuck at ~700M at all times while cp runs. If kswapd is wakenup too soon, to me that is a separate problem and I still don't see a significant benefit of having any "gap" bigger than "high-low" there... Like you said kswapd shouldn't run until we hit the low wmark again on all zones, and I think that's more than enough without more "gap" than the already available default "high-low" gap for the lower zones. If the zone is bigger (like the below4g zone above) the wmark will be bigger relative to the other zones. So when kswapd is wakenup because all zones reach low wmark (we agree this is what should happen even if it doesn't look like it's working right with "cp"), assuming all cache is clean and immediately freeable kswapd will have to invoke shrink_cache more times for the below4g zone. This "gap" added to "high-low" will make the above4g lru rotate more times than needed to reach the high wmark. But we allocated only "high-low" amount of cache in the above4g zone lru. So I'm not sure if shrinking more than "high-low" from it is right even from a balancing prospective in the absolute trivial case of just 1 wakeup every time all zones hits the low wmark. At the same time if kswapd frees memory at the same rate that an over4g allocator is allocating it, kswapd won't go to sleep and there will be no rotation in the below4g lru at all. This is similar of what we see above in fact, except for me kswapd goes to sleep because cp isn't fast enough but a page fault could trigger it and prevent the lru of the lower zones to ever rotate (simulating a kswapd wakeup too soon, by just not making kswapd go to sleep and keeping hitting on the high-low range on the over4g zone). So you see, there is no real reliable way to have balancing guarantees from kswapd, and for the trivial case where there is no concurrency between allocator and kswapd freeing, rotating more the tiny above4g lru than "high-low" despite we only allocated "high-low" cache into it doesn't sound obviously right either. Bigger gap to me looks like will do more harm than good and if we need a real guarantee of balancing we should rotate the allocations across the zones (bigger lru in a zone will require it to be hit more frequently because it'll rotate slower than the other zones, the bias should not even dependent on the zone size but on the lru size). So for now it's all statistical but I doubt the "gap" shrunk in addition of the "high-low" cache max allocated, is providing benefit. Even in the non racing case all I can see is the smaller zones (satisfying the "high" wmark faster than the bigger zones) (and the smaller zones statistically should get a smaller lru too) being lru-rotated way more than their small "high-low". Smaller zone should be rotated in proportion of their small "high-low" only, and not potentially as big as the biggest "high-low" for the biggest zone. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-28 16:46 ` Mel Gorman 2011-01-28 17:16 ` Rik van Riel @ 2011-01-28 17:34 ` Andrea Arcangeli 1 sibling, 0 replies; 52+ messages in thread From: Andrea Arcangeli @ 2011-01-28 17:34 UTC (permalink / raw) To: Mel Gorman; +Cc: Rik van Riel, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On Fri, Jan 28, 2011 at 04:46:24PM +0000, Mel Gorman wrote: > On Fri, Jan 28, 2011 at 05:28:31PM +0100, Andrea Arcangeli wrote: > > On Fri, Jan 28, 2011 at 10:35:39AM +0000, Mel Gorman wrote: > > > I'd be ok with high+low as a starting point to solve the immediate > > > problem of way too much memory being free and then treat "kswapd must go > > > to sleep" as a separate problem. I'm less keen on 1% but only because it > > > could be too large a value. > > > > min(1%, low) sounds best to me. Because on the 4G system "low" is likely > > bigger than 1%. > > > > On a 4G system, sure. On a 16G system, the gap is larger than > min_free_kbytes. Granted, in that case it's less of a problem because we > don't have a small higher zone causing problems. Agreed, there I also prefer the low wmark ;). > Yep, this is why there is an excess of free memory and kswapd stuck in D state > as it's stuck in balance_pgdat(). kswapd in the "cp /dev/sda /dev/null" workload can't possibly be stuck in D state at any given tiem. There's no I/O it has to do, it's 100% clean cache. It's always in S or R state. But every time it gets waken up when the over4g zone hits the low wmark, it shrinks the over4g until it's over "high" and also until all below zones are "high+gap". So in 5 sec what happens is the other zones are stuck at "high+gap" and it stops shrinking them forever, and it only keeps the over-4g zone from "low" to "high", because the allocator picks always from the over4g zone. > A consequence of this is that it's much harder for pages in a small high zone > to get old while kswapd stays awake. They get reclaimed far sooner than pages > in the Normal soon which no doubt leads to some unexpected slowdowns. It's > another reason why we should be making sure kswapd gets to sleep when > there is no pressure. The problem it's not kswapd, it's the allocator. There's nothing kswapd can do about it. kswapd has no fatigue in shrinking any zone, it's all 100% clean immediately reclaimable cache, we could shrink it even from GFP_ATOMIC context from irq (just not nmi) if we wanted. > There might be less memory freed by lowering that gap but it still needs to > exit balance_pgdat() and go to sleep. Otherwise it'll keep freeing zones up > to the high watermark + gap and calling congestion_wait (hence the D state). I just can't see how the size of the "gap" can make any difference, 0 gap or 1g gap, the only thing that will change is the amount of memory free you see, the kswapd state not. > Ok, the gap idea will certainly work in that there will be less memory > freed. It's the first obvious problem and it's the best solution so far. > I will double check myself later if kswapd is stuck in D state due to looping > around balance_pgdat(). I'll check that too, but I don't see how the gap can affect that. Setting the gap to 600M with high set to 100M, is like setting high to 700M manually for that zone and eliminate the gap. Only thing that changes is the behavior of min_free_kbytes. > Rotating through the zones is no problem to implement. The expected problem > is that allocations that could use HighMem or Normal instead use DMA32 > potentially causing a request that requires DMA32 to fail later. Exactly. Note the lowmem reserve ratio algorithm exists exactly to reserve a portion of memory to the users of the lowmem zones. Otherwise things go bad when all memory is free. So thanks to the lowmem reserve ratio algorithm, it's less of an issue to rotate across the zones. But it's a separate issue. > > I guess the LRU caching behavior of a 4g system with a little memory > > over 4g is going to be worse than if you boot with mem=4g and there's > > nothing kswapd can do about it as long as the allocator always grabs > > the new cache page from the highest zone. > > Agreed. > > > Clearly on a 64bit system > > allocating below 4g may be ok, but on 32bit system allocating in the > > normal zone below 800m must be absolutely avoided. So it's not simple > > problem. > > Exactly. Full agreement here. As said above it is very possible the lowmem reserve ratio is enough and we can now rotate freely across the zones. The lowmem reserve ratio is already tuned in a way that on a 32G x86_32 all the normal zone will be forbidden. It scales down as the ratio between the highemm vs normal zone goes down. On a 1g system most of the normal zone becomes available also for highmem allocations. It's made exactly for that. If we want to tackle this later we can and we can try to depend entirely on the lowmem reserve ratio to do the right thing at allocation time by making all wmark variable depending on who's allocating what, but kswapd should just stick to "high" IMHO and gap 0. However if I'm proven wrong then I'm also ok with min(1%, low), no problem with me. Once we fix this (either with gap 0 or gap min(1%,low)), running -set-recommended-min_free_kbytes should lead to less memory wasted (in the 4g setup with a little memory over 4g) then before running -set-recommended-min_free_kbytes at boot. > > Personally I never liked per-zone lru because of this. But > > kswapd isn't the solution and it just wastes memory with no benefit > > possible except for the first 5sec when the free memory goes up from > > 170M to 700M and then it remains stuck at 700M while cp runs for > > another 2 hours to read all 500G of hd. > > > > :/ ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-28 16:28 ` Andrea Arcangeli 2011-01-28 16:46 ` Mel Gorman @ 2011-01-28 17:10 ` Rik van Riel 1 sibling, 0 replies; 52+ messages in thread From: Rik van Riel @ 2011-01-28 17:10 UTC (permalink / raw) To: Andrea Arcangeli Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On 01/28/2011 11:28 AM, Andrea Arcangeli wrote: > In short I think the zone balancing problem tackled in kswapd is wrong > and kswapd should stick to the high wmark only, and if you care about > zone balancing it should be done in the allocator only, then kswapd > will cope with whatever the allocator decides just fine. The allocator does not have information on which memory zones have more heavily used data vs which zones have less frequently used data. When the system starts up, we do our initial allocations in the top zone. This includes both heavily used files (like libc) and never-used-again files, as well as daemons that are active and daemons that go to sleep and never do anything again. After initial startup, we may eventually end up falling back to lower memory zones. In short, we may have an imbalance between the zones in how actively memory is used, from the moment the system has started up. The distance between the low and high watermarks corresponds only to the relative size of each zone. Having kswapd move only between these two watermarks means that memory in each zone is allocated and freed only according to zone size, not according to how actively used the memory in each zone is. Giving kswapd a little bit of extra room where it is allowed to extra free pages in a zone with lots of infrequently used and easily reclaimable pages, when another zone in the same node suffers from harder to deal with memory pressure, will steer more allocations towards the memory zone that has less pressure. This should even out the pressure between zones over time. We have had the kernel work like this since 2.6.0, and I believe that removing this "pressure valve" from the VM will result in the kind of balancing problems we had in some 2.4 kernels. Reducing the size of the gap is fine with me, since the pressure should even out over time. Removing the gap is just asking for trouble. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-27 15:27 ` Andrea Arcangeli 2011-01-27 16:03 ` Mel Gorman @ 2011-02-03 2:58 ` Andrea Arcangeli 2011-02-03 13:15 ` Mel Gorman ` (2 more replies) 2011-02-12 9:48 ` alex shi 2 siblings, 3 replies; 52+ messages in thread From: Andrea Arcangeli @ 2011-02-03 2:58 UTC (permalink / raw) To: Mel Gorman; +Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel On Thu, Jan 27, 2011 at 04:27:55PM +0100, Andrea Arcangeli wrote: > totally untested... I will test.... The below patch is fixing my problem and working fine for me... as expected it can't possibly lead to any D state, it's pretty much like setting min_free_kbytes lower, and it's not going to alter anything other than the levels of free memory kept by kswapd. $ while :; do ps xa|grep [k]swapd; sleep 1; done 452 ? R 1:20 [kswapd0] 452 ? S 1:20 [kswapd0] 452 ? S 1:20 [kswapd0] 452 ? S 1:20 [kswapd0] 452 ? S 1:20 [kswapd0] 452 ? R 1:20 [kswapd0] 452 ? R 1:20 [kswapd0] 452 ? R 1:20 [kswapd0] 452 ? R 1:20 [kswapd0] 452 ? S 1:20 [kswapd0] 452 ? R 1:20 [kswapd0] $ vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 1 1784 111040 2393336 807924 0 0 63 992 56 70 1 1 96 2 0 1 1784 108928 2402556 801864 0 0 122624 0 1619 2150 0 5 80 16 0 1 1784 110664 2401244 801140 0 0 122496 0 1602 2081 0 3 81 16 0 1 1784 109796 2410184 792984 0 0 122752 0 1685 2149 0 4 80 16 0 1 1784 110416 2411856 791208 0 0 120448 4 1599 2075 0 4 81 16 1 0 1784 113516 2415344 785336 0 0 122496 0 1636 2125 0 4 81 15 I doubt we'll get any regression because of the below (see also my prev email in this thread), and I would only expect more cache and maybe better lru. Previously the free memory levels were stuck at ~700M now they're stuck at the right level for a 4G system with THP on (I'd still like to try to reduce the requirements only 1 hugepage for each migratetype in the set_min_free_kbytes to reduce the requirements to the minium, but only if possible..). But this saves 600M over 4G so it's the highest prio to address. Comments welcome, Thanks! Andrea > ==== > Subject: vmscan: kswapd must not free more than high_wmark pages > > From: Andrea Arcangeli <aarcange@redhat.com> > > When the min_free_kbytes is set with `hugeadm > --set-recommended-min_free_kbytes" or with THP enabled (which runs the > equivalent of "hugeadm --set-recommended-min_free_kbytes" to activate > anti-frag at full effectiveness automatically at boot) the high wmark > of some zone is as high as ~88M. 88M free on a 4G system isn't > horrible, but 88M*8 = 704M free on a 4G system is definitely > unbearable. This only tends to be visible on 4G systems with tiny > over-4g zone where kswapd insists to reach the high wmark on the > over-4g zone but doing so it shrunk up to 704M from the normal zone by > mistake. > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> > --- > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index f5d90de..9e3c78e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2407,7 +2407,7 @@ loop_again: > * zone has way too many pages free already. > */ > if (!zone_watermark_ok_safe(zone, order, > - 8*high_wmark_pages(zone), end_zone, 0)) > + high_wmark_pages(zone), end_zone, 0)) > shrink_zone(priority, zone, &sc); > reclaim_state->reclaimed_slab = 0; > nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-03 2:58 ` Andrea Arcangeli @ 2011-02-03 13:15 ` Mel Gorman 2011-02-03 18:59 ` Andrea Arcangeli 2011-02-03 14:36 ` Rik van Riel 2011-02-14 2:25 ` Shaohua Li 2 siblings, 1 reply; 52+ messages in thread From: Mel Gorman @ 2011-02-03 13:15 UTC (permalink / raw) To: Andrea Arcangeli Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel On Thu, Feb 03, 2011 at 03:58:08AM +0100, Andrea Arcangeli wrote: > On Thu, Jan 27, 2011 at 04:27:55PM +0100, Andrea Arcangeli wrote: > > totally untested... I will test.... > > The below patch is fixing my problem and working fine for me... as > expected it can't possibly lead to any D state, it's pretty much like > setting min_free_kbytes lower, and it's not going to alter anything > other than the levels of free memory kept by kswapd. > > $ while :; do ps xa|grep [k]swapd; sleep 1; done > 452 ? R 1:20 [kswapd0] > 452 ? S 1:20 [kswapd0] > 452 ? S 1:20 [kswapd0] > 452 ? S 1:20 [kswapd0] > 452 ? S 1:20 [kswapd0] > 452 ? R 1:20 [kswapd0] > 452 ? R 1:20 [kswapd0] > 452 ? R 1:20 [kswapd0] > 452 ? R 1:20 [kswapd0] > 452 ? S 1:20 [kswapd0] > 452 ? R 1:20 [kswapd0] I got a chance to test this today and I see similar results. I still do see kswapd entering D state occasionally and I'm convinced it's because it's calling congestion_wait() i.e. it's not real IO but it's being accounted for as an IO-related wait. That said, it's mostly asleep (S) or running (R) and free memory is at reasonable levels so it's a big improvement. > $ vmstat 1 > procs -----------memory---------- ---swap-- -----io---- -system-- > ----cpu---- > r b swpd free buff cache si so bi bo in cs us > sy id wa > 2 1 1784 111040 2393336 807924 0 0 63 992 56 70 1 1 96 2 > 0 1 1784 108928 2402556 801864 0 0 122624 0 1619 2150 0 5 80 16 > 0 1 1784 110664 2401244 801140 0 0 122496 0 1602 2081 0 3 81 16 > 0 1 1784 109796 2410184 792984 0 0 122752 0 1685 2149 0 4 80 16 > 0 1 1784 110416 2411856 791208 0 0 120448 4 1599 2075 0 4 81 16 > 1 0 1784 113516 2415344 785336 0 0 122496 0 1636 2125 0 4 81 15 > > I doubt we'll get any regression because of the below (see also my > prev email in this thread), and I would only expect more cache and > maybe better lru. Previously the free memory levels were stuck at > ~700M now they're stuck at the right level for a 4G system with THP on > (I'd still like to try to reduce the requirements only 1 hugepage for > each migratetype in the set_min_free_kbytes to reduce the requirements > to the minium, but only if possible..). But this saves 600M over 4G so > it's the highest prio to address. > > Comments welcome, I think this is the best direction to take for the moment to close the obvious bug. More thought is required on when exactly kswapd is going to sleep and on what zones the allocator should be using but there is no quick answer that will simply have other consequences. As much as I'd like to investigate this further now, I'm in the process of changing jobs and expect to be heavily disrupted for at least a month during the changeover. So, for this; Reviewed-and-tested-by: Mel Gorman <mel@csn.ul.ie> > Thanks! > Andrea > > > ==== > > Subject: vmscan: kswapd must not free more than high_wmark pages > > > > From: Andrea Arcangeli <aarcange@redhat.com> > > > > When the min_free_kbytes is set with `hugeadm > > --set-recommended-min_free_kbytes" or with THP enabled (which runs the > > equivalent of "hugeadm --set-recommended-min_free_kbytes" to activate > > anti-frag at full effectiveness automatically at boot) the high wmark > > of some zone is as high as ~88M. 88M free on a 4G system isn't > > horrible, but 88M*8 = 704M free on a 4G system is definitely > > unbearable. This only tends to be visible on 4G systems with tiny > > over-4g zone where kswapd insists to reach the high wmark on the > > over-4g zone but doing so it shrunk up to 704M from the normal zone by > > mistake. > > > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> > > --- > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index f5d90de..9e3c78e 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2407,7 +2407,7 @@ loop_again: > > * zone has way too many pages free already. > > */ > > if (!zone_watermark_ok_safe(zone, order, > > - 8*high_wmark_pages(zone), end_zone, 0)) > > + high_wmark_pages(zone), end_zone, 0)) > > shrink_zone(priority, zone, &sc); > > reclaim_state->reclaimed_slab = 0; > > nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, > > > > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- Mel Gorman -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-03 13:15 ` Mel Gorman @ 2011-02-03 18:59 ` Andrea Arcangeli 0 siblings, 0 replies; 52+ messages in thread From: Andrea Arcangeli @ 2011-02-03 18:59 UTC (permalink / raw) To: Mel Gorman; +Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel On Thu, Feb 03, 2011 at 01:15:49PM +0000, Mel Gorman wrote: > I got a chance to test this today and I see similar results. I still do see > kswapd entering D state occasionally and I'm convinced it's because it's > calling congestion_wait() i.e. it's not real IO but it's being accounted > for as an IO-related wait. That said, it's mostly asleep (S) or running (R) > and free memory is at reasonable levels so it's a big improvement. I never seen it in D state here but maybe it happens occasionally and I would expect the R/S/D states not to be altered by this change, just the free levels should be altered. > I think this is the best direction to take for the moment to close the obvious > bug. More thought is required on when exactly kswapd is going to sleep and > on what zones the allocator should be using but there is no quick answer that > will simply have other consequences. As much as I'd like to investigate this > further now, I'm in the process of changing jobs and expect to be heavily > disrupted for at least a month during the changeover. So, for this; I full agree we should check (with less hurry) exactly when kswapd is going to sleep in this load in case it's waken too early. I expect it will remain an independent issue and I don't expect this patch having to be reversed once we figure why free levels stays always at "high" and we don't see them reaching "low". Thanks for the review, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-03 2:58 ` Andrea Arcangeli 2011-02-03 13:15 ` Mel Gorman @ 2011-02-03 14:36 ` Rik van Riel 2011-02-03 19:11 ` Andrea Arcangeli 2011-02-14 2:25 ` Shaohua Li 2 siblings, 1 reply; 52+ messages in thread From: Rik van Riel @ 2011-02-03 14:36 UTC (permalink / raw) To: Andrea Arcangeli Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On 02/02/2011 09:58 PM, Andrea Arcangeli wrote: > Comments welcome, > Thanks! > Andrea > >> ==== >> Subject: vmscan: kswapd must not free more than high_wmark pages NAK I believe we need a little bit of slack above high_wmark_pages, to be able to even out memory pressure between zones. Maybe free up to high_wmark_pages + min_wmark_pages ? -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-03 14:36 ` Rik van Riel @ 2011-02-03 19:11 ` Andrea Arcangeli 2011-02-12 1:28 ` Simon Kirby 0 siblings, 1 reply; 52+ messages in thread From: Andrea Arcangeli @ 2011-02-03 19:11 UTC (permalink / raw) To: Rik van Riel; +Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On Thu, Feb 03, 2011 at 09:36:47AM -0500, Rik van Riel wrote: > On 02/02/2011 09:58 PM, Andrea Arcangeli wrote: > > > Comments welcome, > > Thanks! > > Andrea > > > >> ==== > >> Subject: vmscan: kswapd must not free more than high_wmark pages > > NAK > > I believe we need a little bit of slack above high_wmark_pages, > to be able to even out memory pressure between zones. > > Maybe free up to high_wmark_pages + min_wmark_pages ? If this can only go in with high+min that's still better than *8, but in prev email on this thread I explained why I think it's not beneficial for lru balancing and this level can't affect kswapd wakeup times either, so I personally prefer just "high". I don't think out of memory has anything to do with this the "min" level is all about the PF_MEMALLOC and OOM levels. The zone balancing as well has nothing to do with this and the only "hard" thing that guarantees balancing is the lowmem reserve ratio (high ptes allocated in lowmem zones aren't relocatable etc..). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-03 19:11 ` Andrea Arcangeli @ 2011-02-12 1:28 ` Simon Kirby 0 siblings, 0 replies; 52+ messages in thread From: Simon Kirby @ 2011-02-12 1:28 UTC (permalink / raw) To: Andrea Arcangeli Cc: Rik van Riel, Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C On Thu, Feb 03, 2011 at 08:11:57PM +0100, Andrea Arcangeli wrote: > On Thu, Feb 03, 2011 at 09:36:47AM -0500, Rik van Riel wrote: > > On 02/02/2011 09:58 PM, Andrea Arcangeli wrote: > > > > >> Subject: vmscan: kswapd must not free more than high_wmark pages > > > > NAK > > > > I believe we need a little bit of slack above high_wmark_pages, > > to be able to even out memory pressure between zones. > > > > Maybe free up to high_wmark_pages + min_wmark_pages ? > > If this can only go in with high+min that's still better than *8, but > in prev email on this thread I explained why I think it's not > beneficial for lru balancing and this level can't affect kswapd wakeup > times either, so I personally prefer just "high". I don't think out of > memory has anything to do with this the "min" level is all about the > PF_MEMALLOC and OOM levels. The zone balancing as well has nothing to > do with this and the only "hard" thing that guarantees balancing is > the lowmem reserve ratio (high ptes allocated in lowmem zones aren't > relocatable etc..). I was proposing before that the allocator fast path should use a weighted (by zone size) round robin approach to the available zones, rather than allocating from top down, so that reclaim would be fair rather than small zones reclaiming stuff earlier than larger zones. Riel pointed out that this 8*high_wmark_pages thing helped free a proportional amount of stuff from the zone once the high_wmark was breached, eventually causing allocation rates for each zone to end up being close to the actual size of the zone. This happens because the watermark values are set based on the size of the zone. I still think this approach is a bit odd, since when kswapd first wakes up, systems with multiple zones will reclaim things that aren't as old as the stuff in the highest zone, until the system runs for a while and this watermark thing balances the allocation rates. OTOH, changing the allocator increases the possibility of some high-order DMA zone allocation failing during boot that otherwise wouldn't. Simon- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-03 2:58 ` Andrea Arcangeli 2011-02-03 13:15 ` Mel Gorman 2011-02-03 14:36 ` Rik van Riel @ 2011-02-14 2:25 ` Shaohua Li 2011-02-22 14:25 ` Mel Gorman 2 siblings, 1 reply; 52+ messages in thread From: Shaohua Li @ 2011-02-14 2:25 UTC (permalink / raw) To: Andrea Arcangeli Cc: Mel Gorman, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel, alex.shi On Thu, Feb 03, 2011 at 10:58:08AM +0800, Andrea Arcangeli wrote: > On Thu, Jan 27, 2011 at 04:27:55PM +0100, Andrea Arcangeli wrote: > > totally untested... I will test.... > > The below patch is fixing my problem and working fine for me... as > expected it can't possibly lead to any D state, it's pretty much like > setting min_free_kbytes lower, and it's not going to alter anything > other than the levels of free memory kept by kswapd. > > $ while :; do ps xa|grep [k]swapd; sleep 1; done > 452 ? R 1:20 [kswapd0] > 452 ? S 1:20 [kswapd0] > 452 ? S 1:20 [kswapd0] > 452 ? S 1:20 [kswapd0] > 452 ? S 1:20 [kswapd0] > 452 ? R 1:20 [kswapd0] > 452 ? R 1:20 [kswapd0] > 452 ? R 1:20 [kswapd0] > 452 ? R 1:20 [kswapd0] > 452 ? S 1:20 [kswapd0] > 452 ? R 1:20 [kswapd0] > $ vmstat 1 > procs -----------memory---------- ---swap-- -----io---- -system-- > ----cpu---- > r b swpd free buff cache si so bi bo in cs us > sy id wa > 2 1 1784 111040 2393336 807924 0 0 63 992 56 70 1 1 96 2 > 0 1 1784 108928 2402556 801864 0 0 122624 0 1619 2150 0 5 80 16 > 0 1 1784 110664 2401244 801140 0 0 122496 0 1602 2081 0 3 81 16 > 0 1 1784 109796 2410184 792984 0 0 122752 0 1685 2149 0 4 80 16 > 0 1 1784 110416 2411856 791208 0 0 120448 4 1599 2075 0 4 81 16 > 1 0 1784 113516 2415344 785336 0 0 122496 0 1636 2125 0 4 81 15 > > I doubt we'll get any regression because of the below (see also my > prev email in this thread), and I would only expect more cache and > maybe better lru. Previously the free memory levels were stuck at > ~700M now they're stuck at the right level for a 4G system with THP on > (I'd still like to try to reduce the requirements only 1 hugepage for > each migratetype in the set_min_free_kbytes to reduce the requirements > to the minium, but only if possible..). But this saves 600M over 4G so > it's the highest prio to address. Sorry for the later response, I offlined several weeks. The patch is addressing the 8*high_wmark issue, which isn't the original issue I reported (sure the 8*wmark issue should be fixed too). min_free_kbytes is set higher and cause more pages freed even no the 8*wmark issue. wmark: before: min 1424 after: min 11178 in our test, there is about 50M memory free (originally just about 5M, which will cause more swap. Should we also reduce the min_free_kbytes? Thanks, Shaohua -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-14 2:25 ` Shaohua Li @ 2011-02-22 14:25 ` Mel Gorman 2011-02-22 14:42 ` Andrea Arcangeli 2011-02-23 5:29 ` Shaohua Li 0 siblings, 2 replies; 52+ messages in thread From: Mel Gorman @ 2011-02-22 14:25 UTC (permalink / raw) To: Shaohua Li Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel, alex.shi On Mon, Feb 14, 2011 at 10:25:24AM +0800, Shaohua Li wrote: > On Thu, Feb 03, 2011 at 10:58:08AM +0800, Andrea Arcangeli wrote: > > On Thu, Jan 27, 2011 at 04:27:55PM +0100, Andrea Arcangeli wrote: > > > totally untested... I will test.... > > > > The below patch is fixing my problem and working fine for me... as > > expected it can't possibly lead to any D state, it's pretty much like > > setting min_free_kbytes lower, and it's not going to alter anything > > other than the levels of free memory kept by kswapd. > > > > $ while :; do ps xa|grep [k]swapd; sleep 1; done > > 452 ? R 1:20 [kswapd0] > > 452 ? S 1:20 [kswapd0] > > 452 ? S 1:20 [kswapd0] > > 452 ? S 1:20 [kswapd0] > > 452 ? S 1:20 [kswapd0] > > 452 ? R 1:20 [kswapd0] > > 452 ? R 1:20 [kswapd0] > > 452 ? R 1:20 [kswapd0] > > 452 ? R 1:20 [kswapd0] > > 452 ? S 1:20 [kswapd0] > > 452 ? R 1:20 [kswapd0] > > $ vmstat 1 > > procs -----------memory---------- ---swap-- -----io---- -system-- > > ----cpu---- > > r b swpd free buff cache si so bi bo in cs us > > sy id wa > > 2 1 1784 111040 2393336 807924 0 0 63 992 56 70 1 1 96 2 > > 0 1 1784 108928 2402556 801864 0 0 122624 0 1619 2150 0 5 80 16 > > 0 1 1784 110664 2401244 801140 0 0 122496 0 1602 2081 0 3 81 16 > > 0 1 1784 109796 2410184 792984 0 0 122752 0 1685 2149 0 4 80 16 > > 0 1 1784 110416 2411856 791208 0 0 120448 4 1599 2075 0 4 81 16 > > 1 0 1784 113516 2415344 785336 0 0 122496 0 1636 2125 0 4 81 15 > > > > I doubt we'll get any regression because of the below (see also my > > prev email in this thread), and I would only expect more cache and > > maybe better lru. Previously the free memory levels were stuck at > > ~700M now they're stuck at the right level for a 4G system with THP on > > (I'd still like to try to reduce the requirements only 1 hugepage for > > each migratetype in the set_min_free_kbytes to reduce the requirements > > to the minium, but only if possible..). But this saves 600M over 4G so > > it's the highest prio to address. > Sorry for the later response, I offlined several weeks. > The patch is addressing the 8*high_wmark issue, which isn't the original issue > I reported (sure the 8*wmark issue should be fixed too). > min_free_kbytes is set higher and cause more pages freed even no the 8*wmark > issue. wmark: > before: min 1424 > after: min 11178 The higher min_free_kbytes is expected as a result of using transparent hugepages so I don't really consider it a bug. Free memory going up to about 700M as a result of kswapd is a real bug though. > in our test, there is about 50M memory free (originally just about 5M, which > will cause more swap. Should we also reduce the min_free_kbytes? > Either that or boot with transparent hugepages disabled and min_free_kbytes will be lower. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-22 14:25 ` Mel Gorman @ 2011-02-22 14:42 ` Andrea Arcangeli 2011-02-22 14:50 ` Mel Gorman 2011-02-22 16:04 ` Mel Gorman 2011-02-23 5:29 ` Shaohua Li 1 sibling, 2 replies; 52+ messages in thread From: Andrea Arcangeli @ 2011-02-22 14:42 UTC (permalink / raw) To: Mel Gorman Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel, alex.shi On Tue, Feb 22, 2011 at 02:25:59PM +0000, Mel Gorman wrote: > The higher min_free_kbytes is expected as a result of using transparent > hugepages so I don't really consider it a bug. Free memory going up to That's true. THP can definitely increase the memory footprint of certain apps. Especially if the app is allocating lots of data but only touching a few bytes scattered over the mapping, the memory footprint can increase up to 512fold (absolute worst case of course, in average it will be less). This is why there's the enabled=madvise option after all. > about 700M as a result of kswapd is a real bug though. Yes. > > in our test, there is about 50M memory free (originally just about 5M, which > > will cause more swap. Should we also reduce the min_free_kbytes? > > > > Either that or boot with transparent hugepages disabled and > min_free_kbytes will be lower. I suggest to boot with transparent_hugepage=madvise, or to set the default to madvise in make menuconfig. That will still enable the anti-frag logic in the buddy allocator in full. If the problem goes away with the madvise setting, then it's not related to min_free_kbytes. With the 700M fix for kswapd however it's hard to imagine the increase min_free_kbytes to cause out of memory conditions even if it uses a little more memory to allow for increased performance thanks to hugepages. Another thing we can change (in addition to the 700M-waste fix in kswapd) is this: /* * By default disable transparent hugepages on smaller systems, * where the extra memory used could hurt more than TLB overhead * is likely to save. The admin can still enable it through /sys. */ if (totalram_pages < (512 << (20 - PAGE_SHIFT))) transparent_hugepage_flags = 0; and: /* don't ever allow to reserve more than 5% of the lowmem */ recommended_min = min(recommended_min, (unsigned long) nr_free_buffer_pages() / 20); We can reduce the max min_free_kbytes to less than 5% of the lowmem, and we can also decide not to enable THP if there's less than 2G instead of "less than 512M". I'm also intrigued by reducing this from 2 to 1: /* Make sure at least 2 hugepages are free for MIGRATE_RESERVE */ recommended_min = pageblock_nr_pages * nr_zones * 2; Do we really need 2 pages instead of just 1 here to provide the guarantee? I thought 1 page would be enough. But you know anti-frag logic better ;). It won't save a lot of memory but just a couple of mbytes, I doubt it can make any real difference. Still I prefer 1 if it's enough. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-22 14:42 ` Andrea Arcangeli @ 2011-02-22 14:50 ` Mel Gorman 2011-02-22 14:54 ` Andrea Arcangeli 2011-02-22 16:04 ` Mel Gorman 1 sibling, 1 reply; 52+ messages in thread From: Mel Gorman @ 2011-02-22 14:50 UTC (permalink / raw) To: Andrea Arcangeli Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel, alex.shi On Tue, Feb 22, 2011 at 03:42:00PM +0100, Andrea Arcangeli wrote: > <SNIP> > > I'm also intrigued by reducing this from 2 to 1: > > /* Make sure at least 2 hugepages are free for MIGRATE_RESERVE */ > recommended_min = pageblock_nr_pages * nr_zones * 2; > > Do we really need 2 pages instead of just 1 here to provide the > guarantee? For workloads that cause a lot of fragmentation - yes. Simplistically with 1, the trace event mm_page_alloc_extfrag will trigger more frequently and it's more likely to be severe. The problem is that if it's not "* 2", there is a very low probability that there will pages free in a suitable pageblock and "mixing" occurs. It can take a very long time for allocation success rates to go down but it happens eventually. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-22 14:50 ` Mel Gorman @ 2011-02-22 14:54 ` Andrea Arcangeli 0 siblings, 0 replies; 52+ messages in thread From: Andrea Arcangeli @ 2011-02-22 14:54 UTC (permalink / raw) To: Mel Gorman Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel, alex.shi On Tue, Feb 22, 2011 at 02:50:31PM +0000, Mel Gorman wrote: > For workloads that cause a lot of fragmentation - yes. Simplistically with 1, > the trace event mm_page_alloc_extfrag will trigger more frequently and > it's more likely to be severe. The problem is that if it's not "* 2", > there is a very low probability that there will pages free in a suitable > pageblock and "mixing" occurs. It can take a very long time for > allocation success rates to go down but it happens eventually. Ok I see. Thanks for the clarification. So I think the other two spots I quoted in prev email are the only two bits we can adjust if booting madvise doesn't fix it completely (in addition to the *8 removal in kswapd, but that only affects ~4G systems, that are however very common this is an old bug that just got better exposed with an higher min_free_kbytes default). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-22 14:42 ` Andrea Arcangeli 2011-02-22 14:50 ` Mel Gorman @ 2011-02-22 16:04 ` Mel Gorman 2011-02-22 16:40 ` Rik van Riel 1 sibling, 1 reply; 52+ messages in thread From: Mel Gorman @ 2011-02-22 16:04 UTC (permalink / raw) To: Andrea Arcangeli Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel, alex.shi On Tue, Feb 22, 2011 at 03:42:00PM +0100, Andrea Arcangeli wrote: > I suggest to boot with transparent_hugepage=madvise, or to set the > default to madvise in make menuconfig. That will still enable the > anti-frag logic in the buddy allocator in full. If the problem goes > away with the madvise setting, then it's not related to > min_free_kbytes. With the 700M fix for kswapd however it's hard to > imagine the increase min_free_kbytes to cause out of memory conditions > even if it uses a little more memory to allow for increased > performance thanks to hugepages. > We didn't really agree on a fix though, did we? At least, I don't see a patch we all agreed on in the thread. I stuck my ack on your patch but Rik nak'd it because he wanted the balance gap to be preserved. We had sortof agreed on a balance gap but didn't post a patch that implemented it. AFAIK, an implementation of what was discussed is blow. If this is not the agreed fix, what is? If we agree on it, can Shaohua confirm the fix works? This is against 2.6.38-rc6 which still isn't fixed and I don't see a candidate fix in mmotm either. ==== CUT HERE ==== mm: vmscan: kswapd should not free an excessive number of pages when balancing small zones When reclaiming for order-0 pages, kswapd requires that all zones be balanced. Each cycle through balance_pgdat() does background ageing on all zones if necessary and applies equal pressure on the inactive zone unless a lot of pages are free already. A "lot of free pages" is defined as a "balance gap" above the high watermark which is currently 7*high_watermark. Historically this was reasonable as min_free_kbytes was small. However, on systems using huge pages, it is recommended that min_free_kbytes is higher and it is tuned with hugeadm --set-recommended-min_free_kbytes. With the introduction of transparent huge page support, this recommended value is also applied. On X86-64 with 4G of memory, min_free_kbytes becomes 67584 so one would expect around 68M of memory to be free. The Normal zone is approximately 35000 pages so under even normal memory pressure such as copying a large file, it gets exhausted quickly. As it is getting exhausted, kswapd applies pressure equally to all zones, including the DMA32 zone. DMA32 is approximately 700,000 pages with a high watermark of around 23,000 pages. In this situation, kswapd will reclaim around (23000*8 where 8 is the high watermark + balance gap of 7 * high watermark) pages or 718M of pages before the zone is ignored. What the user sees is that free memory far higher than it should be. To avoid an excessive number of pages being reclaimed from the larger zones, explicitely defines the "balance gap" to be either 1% of the zone or the low watermark for the zone, whichever is smaller. While kswapd will check all zones to apply pressure, it'll ignore zones that meets the (high_wmark + balance_gap) watermark. To test this, 80G were copied from a partition and the amount of memory being used was recorded. A comparison of a patch and unpatched kernel can be seen at http://www.csn.ul.ie/~mel/postings/minfree-20110222/memory-usage-hydra.ps and shows that kswapd is not reclaiming as much memory with the patch applied. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- include/linux/swap.h | 9 +++++++++ mm/vmscan.c | 16 +++++++++++++--- 2 files changed, 22 insertions(+), 3 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 4d55932..a57c6e7 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -155,6 +155,15 @@ enum { #define SWAP_CLUSTER_MAX 32 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX +/* + * Ratio between the present memory in the zone and the "gap" that + * we're allowing kswapd to shrink in addition to the per-zone high + * wmark, even for zones that already have the high wmark satisfied, + * in order to provide better per-zone lru behavior. We are ok to + * spend not more than 1% of the memory for this zone balancing "gap". + */ +#define KSWAPD_ZONE_BALANCE_GAP_RATIO 100 + #define SWAP_MAP_MAX 0x3e /* Max duplication count, in first swap_map */ #define SWAP_MAP_BAD 0x3f /* Note pageblock is bad, in first swap_map */ #define SWAP_HAS_CACHE 0x40 /* Flag page is cached, in first swap_map */ diff --git a/mm/vmscan.c b/mm/vmscan.c index 17497d0..0c83530 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2388,6 +2388,7 @@ loop_again: int compaction; struct zone *zone = pgdat->node_zones + i; int nr_slab; + unsigned long balance_gap; if (!populated_zone(zone)) continue; @@ -2404,11 +2405,20 @@ loop_again: mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask); /* - * We put equal pressure on every zone, unless one - * zone has way too many pages free already. + * We put equal pressure on every zone, unless + * one zone has way too many pages free + * already. The "too many pages" is defined + * as the high wmark plus a "gap" where the + * gap is either the low watermark or 1% + * of the zone, whichever is smaller. */ + balance_gap = min(low_wmark_pages(zone), + (zone->present_pages + + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) / + KSWAPD_ZONE_BALANCE_GAP_RATIO); if (!zone_watermark_ok_safe(zone, order, - 8*high_wmark_pages(zone), end_zone, 0)) + high_wmark_pages(zone) + balance_gap, + end_zone, 0)) shrink_zone(priority, zone, &sc); reclaim_state->reclaimed_slab = 0; nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-22 16:04 ` Mel Gorman @ 2011-02-22 16:40 ` Rik van Riel 0 siblings, 0 replies; 52+ messages in thread From: Rik van Riel @ 2011-02-22 16:40 UTC (permalink / raw) To: Mel Gorman Cc: Andrea Arcangeli, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C, alex.shi On 02/22/2011 11:04 AM, Mel Gorman wrote: > To avoid an excessive number of pages being reclaimed from the larger zones, > explicitely defines the "balance gap" to be either 1% of the zone or the > low watermark for the zone, whichever is smaller. While kswapd will check > all zones to apply pressure, it'll ignore zones that meets the (high_wmark + > balance_gap) watermark. > > To test this, 80G were copied from a partition and the amount of memory > being used was recorded. A comparison of a patch and unpatched kernel > can be seen at > http://www.csn.ul.ie/~mel/postings/minfree-20110222/memory-usage-hydra.ps > and shows that kswapd is not reclaiming as much memory with the patch > applied. > > Signed-off-by: Andrea Arcangeli<aarcange@redhat.com> > Signed-off-by: Mel Gorman<mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-22 14:25 ` Mel Gorman 2011-02-22 14:42 ` Andrea Arcangeli @ 2011-02-23 5:29 ` Shaohua Li 2011-02-23 14:45 ` Andrea Arcangeli 1 sibling, 1 reply; 52+ messages in thread From: Shaohua Li @ 2011-02-23 5:29 UTC (permalink / raw) To: Mel Gorman Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel, Shi, Alex On Tue, 2011-02-22 at 22:25 +0800, Mel Gorman wrote: > On Mon, Feb 14, 2011 at 10:25:24AM +0800, Shaohua Li wrote: > > On Thu, Feb 03, 2011 at 10:58:08AM +0800, Andrea Arcangeli wrote: > > > On Thu, Jan 27, 2011 at 04:27:55PM +0100, Andrea Arcangeli wrote: > > > > totally untested... I will test.... > > > > > > The below patch is fixing my problem and working fine for me... as > > > expected it can't possibly lead to any D state, it's pretty much like > > > setting min_free_kbytes lower, and it's not going to alter anything > > > other than the levels of free memory kept by kswapd. > > > > > > $ while :; do ps xa|grep [k]swapd; sleep 1; done > > > 452 ? R 1:20 [kswapd0] > > > 452 ? S 1:20 [kswapd0] > > > 452 ? S 1:20 [kswapd0] > > > 452 ? S 1:20 [kswapd0] > > > 452 ? S 1:20 [kswapd0] > > > 452 ? R 1:20 [kswapd0] > > > 452 ? R 1:20 [kswapd0] > > > 452 ? R 1:20 [kswapd0] > > > 452 ? R 1:20 [kswapd0] > > > 452 ? S 1:20 [kswapd0] > > > 452 ? R 1:20 [kswapd0] > > > $ vmstat 1 > > > procs -----------memory---------- ---swap-- -----io---- -system-- > > > ----cpu---- > > > r b swpd free buff cache si so bi bo in cs us > > > sy id wa > > > 2 1 1784 111040 2393336 807924 0 0 63 992 56 70 1 1 96 2 > > > 0 1 1784 108928 2402556 801864 0 0 122624 0 1619 2150 0 5 80 16 > > > 0 1 1784 110664 2401244 801140 0 0 122496 0 1602 2081 0 3 81 16 > > > 0 1 1784 109796 2410184 792984 0 0 122752 0 1685 2149 0 4 80 16 > > > 0 1 1784 110416 2411856 791208 0 0 120448 4 1599 2075 0 4 81 16 > > > 1 0 1784 113516 2415344 785336 0 0 122496 0 1636 2125 0 4 81 15 > > > > > > I doubt we'll get any regression because of the below (see also my > > > prev email in this thread), and I would only expect more cache and > > > maybe better lru. Previously the free memory levels were stuck at > > > ~700M now they're stuck at the right level for a 4G system with THP on > > > (I'd still like to try to reduce the requirements only 1 hugepage for > > > each migratetype in the set_min_free_kbytes to reduce the requirements > > > to the minium, but only if possible..). But this saves 600M over 4G so > > > it's the highest prio to address. > > Sorry for the later response, I offlined several weeks. > > The patch is addressing the 8*high_wmark issue, which isn't the original issue > > I reported (sure the 8*wmark issue should be fixed too). > > min_free_kbytes is set higher and cause more pages freed even no the 8*wmark > > issue. wmark: > > before: min 1424 > > after: min 11178 > > The higher min_free_kbytes is expected as a result of using transparent > hugepages so I don't really consider it a bug. Free memory going up to > about 700M as a result of kswapd is a real bug though. > > > in our test, there is about 50M memory free (originally just about 5M, which > > will cause more swap. Should we also reduce the min_free_kbytes? > > > > Either that or boot with transparent hugepages disabled and > min_free_kbytes will be lower. Fixing it will let more people enable THP by default. but anyway we will disable it now if the issue can't be fixed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-23 5:29 ` Shaohua Li @ 2011-02-23 14:45 ` Andrea Arcangeli 2011-02-24 8:08 ` Shaohua Li 0 siblings, 1 reply; 52+ messages in thread From: Andrea Arcangeli @ 2011-02-23 14:45 UTC (permalink / raw) To: Shaohua Li Cc: Mel Gorman, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel, Shi, Alex On Wed, Feb 23, 2011 at 01:29:14PM +0800, Shaohua Li wrote: > Fixing it will let more people enable THP by default. but anyway we will > disable it now if the issue can't be fixed. Did you try what happens with transparent_hugepage=madvise? If that doesn't fix it, it's min_free_kbytes issue. Also if you're using an heavily threaded application, decreasing the stack size with pthread_attr_setstack to something like 16k will fix it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-23 14:45 ` Andrea Arcangeli @ 2011-02-24 8:08 ` Shaohua Li 2011-02-24 9:52 ` Mel Gorman 2011-02-24 14:04 ` Andrea Arcangeli 0 siblings, 2 replies; 52+ messages in thread From: Shaohua Li @ 2011-02-24 8:08 UTC (permalink / raw) To: Andrea Arcangeli Cc: Mel Gorman, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel, Shi, Alex, Andi Kleen On Wed, 2011-02-23 at 22:45 +0800, Andrea Arcangeli wrote: > On Wed, Feb 23, 2011 at 01:29:14PM +0800, Shaohua Li wrote: > > Fixing it will let more people enable THP by default. but anyway we will > > disable it now if the issue can't be fixed. > > Did you try what happens with transparent_hugepage=madvise? If that > doesn't fix it, it's min_free_kbytes issue. with madvise, the min_free_kbytes is still high (same as the 'always' case). The result is still we have about 50M memory is reserved. you can try at your machine with boot option 'mem=2G' and check the zoneinfo output. Thanks, Shaohua -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-24 8:08 ` Shaohua Li @ 2011-02-24 9:52 ` Mel Gorman 2011-02-24 9:57 ` Mel Gorman 2011-02-24 14:04 ` Andrea Arcangeli 1 sibling, 1 reply; 52+ messages in thread From: Mel Gorman @ 2011-02-24 9:52 UTC (permalink / raw) To: Shaohua Li Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel, Shi, Alex, Andi Kleen On Thu, Feb 24, 2011 at 04:08:47PM +0800, Shaohua Li wrote: > On Wed, 2011-02-23 at 22:45 +0800, Andrea Arcangeli wrote: > > On Wed, Feb 23, 2011 at 01:29:14PM +0800, Shaohua Li wrote: > > > Fixing it will let more people enable THP by default. but anyway we will > > > disable it now if the issue can't be fixed. > > > > Did you try what happens with transparent_hugepage=madvise? If that > > doesn't fix it, it's min_free_kbytes issue. > with madvise, the min_free_kbytes is still high (same as the 'always' > case). This high min_free_kbytes is expected and is not considered a bug as it's related to transparent hugepages being able to allocate huge pages for a long period of time. Essentially, it's a cost of using hugepages. > The result is still we have about 50M memory is reserved. you can > try at your machine with boot option 'mem=2G' and check the zoneinfo > output. > Is the actual free memory around the 50M mark or is it far higher than it should be? -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-24 9:52 ` Mel Gorman @ 2011-02-24 9:57 ` Mel Gorman 2011-02-24 14:27 ` Andrea Arcangeli 0 siblings, 1 reply; 52+ messages in thread From: Mel Gorman @ 2011-02-24 9:57 UTC (permalink / raw) To: Shaohua Li Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel, Shi, Alex, Andi Kleen On Thu, Feb 24, 2011 at 09:52:09AM +0000, Mel Gorman wrote: > On Thu, Feb 24, 2011 at 04:08:47PM +0800, Shaohua Li wrote: > > On Wed, 2011-02-23 at 22:45 +0800, Andrea Arcangeli wrote: > > > On Wed, Feb 23, 2011 at 01:29:14PM +0800, Shaohua Li wrote: > > > > Fixing it will let more people enable THP by default. but anyway we will > > > > disable it now if the issue can't be fixed. > > > > > > Did you try what happens with transparent_hugepage=madvise? If that > > > doesn't fix it, it's min_free_kbytes issue. > > with madvise, the min_free_kbytes is still high (same as the 'always' > > case). > > This high min_free_kbytes is expected and is not considered a bug as it's > related to transparent hugepages being able to allocate huge pages for a > long period of time. Essentially, it's a cost of using hugepages. > I should be clearer here. madvise|always sets a high min_free_kbytes by this check if (ret > 0 && (test_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags) || test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags))) set_recommended_min_free_kbytes(); so I'd expect the new higher value for min_free_kbytes once THP was ever expected to be used. If this new value was still considered a bug, removing the call to set_recommended_min_free_kbytes() would always use the lower value that was used in older kernels. This would "fix" the bug but transparent hugepage users would not get the pages they expected the longer the system was running. This would be harder for ordinary users to catch. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-24 9:57 ` Mel Gorman @ 2011-02-24 14:27 ` Andrea Arcangeli 0 siblings, 0 replies; 52+ messages in thread From: Andrea Arcangeli @ 2011-02-24 14:27 UTC (permalink / raw) To: Mel Gorman Cc: Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel, Shi, Alex, Andi Kleen On Thu, Feb 24, 2011 at 09:57:27AM +0000, Mel Gorman wrote: > I should be clearer here. madvise|always sets a high min_free_kbytes by > this check > > if (ret > 0 && > (test_bit(TRANSPARENT_HUGEPAGE_FLAG, > &transparent_hugepage_flags) || > test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, > &transparent_hugepage_flags))) > set_recommended_min_free_kbytes(); > > so I'd expect the new higher value for min_free_kbytes once THP was ever > expected to be used. > > If this new value was still considered a bug, removing the call to > set_recommended_min_free_kbytes() would always use the lower value that > was used in older kernels. This would "fix" the bug but transparent hugepage > users would not get the pages they expected the longer the system was running. > This would be harder for ordinary users to catch. This is a safe default for TRANSPARENT_HUGEPAGE_FLAG. All servers will want set_recommended_min_free_kbytes. All we can argue on the TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG setting if it needs this or not (maybe we can remove the TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG check considering madvise is mostly for embedded systems that can't waste a byte in case THP increases the memory footprint of the program but they still want to use THP for embedded virt or similar usages that don't waste any memory at peak load). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-24 8:08 ` Shaohua Li 2011-02-24 9:52 ` Mel Gorman @ 2011-02-24 14:04 ` Andrea Arcangeli 2011-02-25 0:51 ` Shaohua Li 1 sibling, 1 reply; 52+ messages in thread From: Andrea Arcangeli @ 2011-02-24 14:04 UTC (permalink / raw) To: Shaohua Li Cc: Mel Gorman, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel, Shi, Alex, Andi Kleen On Thu, Feb 24, 2011 at 04:08:47PM +0800, Shaohua Li wrote: > with madvise, the min_free_kbytes is still high (same as the 'always' > case). The result is still we have about 50M memory is reserved. you can > try at your machine with boot option 'mem=2G' and check the zoneinfo > output. yes I know. The objective of that test was exactly to know if the problem is higher memory footprint because of THP or only the anti-frag/min_free_kbytes which would still be present with the "madvise" setting (anti-frag is only shutdown by the "never" setting). If you still have the out of memory with madvise, then you can keep THP enabled "always" and then "echo 16384 > /proc/sys/vm/min_free_kbytes", it should work fine then even with THP always mode then, no need to disable THP (simply you won't have a good guarantee that anti-frag is functional so the hugepage usage will be reduced over time compared to the default min_free_kbytes that enables anti-frag fully). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-24 14:04 ` Andrea Arcangeli @ 2011-02-25 0:51 ` Shaohua Li 2011-02-25 12:13 ` Mel Gorman 0 siblings, 1 reply; 52+ messages in thread From: Shaohua Li @ 2011-02-25 0:51 UTC (permalink / raw) To: Andrea Arcangeli Cc: Mel Gorman, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel, Shi, Alex, Andi Kleen On Thu, 2011-02-24 at 22:04 +0800, Andrea Arcangeli wrote: > On Thu, Feb 24, 2011 at 04:08:47PM +0800, Shaohua Li wrote: > > with madvise, the min_free_kbytes is still high (same as the 'always' > > case). The result is still we have about 50M memory is reserved. you can > > try at your machine with boot option 'mem=2G' and check the zoneinfo > > output. > > yes I know. The objective of that test was exactly to know if the > problem is higher memory footprint because of THP or only the > anti-frag/min_free_kbytes which would still be present with the > "madvise" setting (anti-frag is only shutdown by the "never" > setting). If you still have the out of memory with madvise, then you > can keep THP enabled "always" and then "echo 16384 > > /proc/sys/vm/min_free_kbytes", it should work fine then even with THP > always mode then, no need to disable THP (simply you won't have a good > guarantee that anti-frag is functional so the hugepage usage will be > reduced over time compared to the default min_free_kbytes that enables > anti-frag fully). I can disable THP or set the min_free_kbytes manually in our test, but just wonder if it's possible we can avoid the memory waste even with THP enabled, because this will make more people enable it by default. If you don't consider this is a problem, we can disable THP. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-25 0:51 ` Shaohua Li @ 2011-02-25 12:13 ` Mel Gorman 0 siblings, 0 replies; 52+ messages in thread From: Mel Gorman @ 2011-02-25 12:13 UTC (permalink / raw) To: Shaohua Li Cc: Andrea Arcangeli, Andrew Morton, linux-mm, Chen, Tim C, Rik van Riel, Shi, Alex, Andi Kleen On Fri, Feb 25, 2011 at 08:51:49AM +0800, Shaohua Li wrote: > On Thu, 2011-02-24 at 22:04 +0800, Andrea Arcangeli wrote: > > On Thu, Feb 24, 2011 at 04:08:47PM +0800, Shaohua Li wrote: > > > with madvise, the min_free_kbytes is still high (same as the 'always' > > > case). The result is still we have about 50M memory is reserved. you can > > > try at your machine with boot option 'mem=2G' and check the zoneinfo > > > output. > > > > yes I know. The objective of that test was exactly to know if the > > problem is higher memory footprint because of THP or only the > > anti-frag/min_free_kbytes which would still be present with the > > "madvise" setting (anti-frag is only shutdown by the "never" > > setting). If you still have the out of memory with madvise, then you > > can keep THP enabled "always" and then "echo 16384 > > > /proc/sys/vm/min_free_kbytes", it should work fine then even with THP > > always mode then, no need to disable THP (simply you won't have a good > > guarantee that anti-frag is functional so the hugepage usage will be > > reduced over time compared to the default min_free_kbytes that enables > > anti-frag fully). > > I can disable THP or set the min_free_kbytes manually in our test, but > just wonder if it's possible we can avoid the memory waste even with THP > enabled, because this will make more people enable it by default. With a lower value of min_free_kbytes, THP would give diminishing returns over time as hugepage allocation success rates start degrading over time. It might not happen for several days or weeks making it a tricky problem to diagnose. So yes, the memory waste with THP enabled can be fixed but it would only be suitable for short-term benchmarks. > If you > don't consider this is a problem, we can disable THP. > -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-01-27 15:27 ` Andrea Arcangeli 2011-01-27 16:03 ` Mel Gorman 2011-02-03 2:58 ` Andrea Arcangeli @ 2011-02-12 9:48 ` alex shi 2011-02-22 14:24 ` Mel Gorman 2 siblings, 1 reply; 52+ messages in thread From: alex shi @ 2011-02-12 9:48 UTC (permalink / raw) To: Andrea Arcangeli Cc: Mel Gorman, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C, alex.shi [-- Attachment #1: Type: text/plain, Size: 3445 bytes --] I am tried the patch, but seems it has no effect for our regression. Regards Alex On Thu, Jan 27, 2011 at 11:27 PM, Andrea Arcangeli <aarcange@redhat.com>wrote: > On Thu, Jan 27, 2011 at 01:40:58PM +0000, Mel Gorman wrote: > > On Wed, Jan 26, 2011 at 05:42:37PM +0000, Mel Gorman wrote: > > > On Wed, Jan 26, 2011 at 04:36:55PM +0000, Mel Gorman wrote: > > > > > But the wmarks don't > > > > > seem the real offender, maybe it's something related to the tiny > pci32 > > > > > zone that materialize on 4g systems that relocate some little > memory > > > > > over 4g to make space for the pci32 mmio. I didn't yet finish to > debug > > > > > it. > > > > > > > > > > > > > This has to be it. What I think is happening is that we're in > balance_pgdat(), > > > > the "Normal" zone is never hitting the watermark and we constantly > call > > > > "goto loop_again" trying to "rebalance" all zones. > > > > > > > > > > Confirmed. > > > <SNIP> > > > > How about the following? Functionally it would work but I am concerned > > that the logic in balance_pgdat() and kswapd() is getting out of hand > > having being adjusted to work with a number of corner cases already. In > > the next cycle, it could do with a "do-over" attempt to make it easier > > to follow. > > That number 8 is the problem, I don't think anybody was ever supposed > to free 8*highwmark pages. kswapd must work in the hysteresis range > low->high area and then sleep wait low to hit again before it gets > wakenup. Not sure how that number 8 ever come up... but to be it looks > like the real offender and I wouldn't work around it. > > totally untested... I will test.... > > ==== > Subject: vmscan: kswapd must not free more than high_wmark pages > > From: Andrea Arcangeli <aarcange@redhat.com> > > When the min_free_kbytes is set with `hugeadm > --set-recommended-min_free_kbytes" or with THP enabled (which runs the > equivalent of "hugeadm --set-recommended-min_free_kbytes" to activate > anti-frag at full effectiveness automatically at boot) the high wmark > of some zone is as high as ~88M. 88M free on a 4G system isn't > horrible, but 88M*8 = 704M free on a 4G system is definitely > unbearable. This only tends to be visible on 4G systems with tiny > over-4g zone where kswapd insists to reach the high wmark on the > over-4g zone but doing so it shrunk up to 704M from the normal zone by > mistake. > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> > --- > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index f5d90de..9e3c78e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2407,7 +2407,7 @@ loop_again: > * zone has way too many pages free already. > */ > if (!zone_watermark_ok_safe(zone, order, > - 8*high_wmark_pages(zone), end_zone, > 0)) > + high_wmark_pages(zone), end_zone, > 0)) > shrink_zone(priority, zone, &sc); > reclaim_state->reclaimed_slab = 0; > nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > [-- Attachment #2: Type: text/html, Size: 4702 bytes --] ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: too big min_free_kbytes 2011-02-12 9:48 ` alex shi @ 2011-02-22 14:24 ` Mel Gorman 0 siblings, 0 replies; 52+ messages in thread From: Mel Gorman @ 2011-02-22 14:24 UTC (permalink / raw) To: alex shi Cc: Andrea Arcangeli, Shaohua Li, Andrew Morton, linux-mm, Chen, Tim C, alex.shi On Sat, Feb 12, 2011 at 05:48:55PM +0800, alex shi wrote: > I am tried the patch, but seems it has no effect for our regression. > What is the nature of your regression? I see no details of it in the thread. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2011-02-25 12:14 UTC | newest] Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-01-24 3:56 too big min_free_kbytes Shaohua Li 2011-01-24 15:00 ` Andrea Arcangeli 2011-01-25 14:35 ` Mel Gorman 2011-01-26 14:17 ` Mel Gorman 2011-01-26 15:23 ` Mel Gorman 2011-01-26 15:42 ` Andrea Arcangeli 2011-01-26 16:36 ` Mel Gorman 2011-01-26 17:42 ` Mel Gorman 2011-01-27 13:40 ` Mel Gorman 2011-01-27 15:27 ` Andrea Arcangeli 2011-01-27 16:03 ` Mel Gorman 2011-01-27 18:52 ` Andrea Arcangeli 2011-01-27 20:33 ` Rik van Riel 2011-01-27 21:31 ` Mel Gorman 2011-01-27 23:18 ` Rik van Riel 2011-01-28 10:35 ` Mel Gorman 2011-01-28 16:28 ` Andrea Arcangeli 2011-01-28 16:46 ` Mel Gorman 2011-01-28 17:16 ` Rik van Riel 2011-01-28 17:46 ` Andrea Arcangeli 2011-01-28 18:03 ` Rik van Riel 2011-01-28 18:24 ` Andrea Arcangeli 2011-01-28 19:34 ` Rik van Riel 2011-01-28 19:45 ` Andrea Arcangeli 2011-01-28 20:55 ` Rik van Riel 2011-01-29 19:45 ` Andrea Arcangeli 2011-01-28 17:34 ` Andrea Arcangeli 2011-01-28 17:10 ` Rik van Riel 2011-02-03 2:58 ` Andrea Arcangeli 2011-02-03 13:15 ` Mel Gorman 2011-02-03 18:59 ` Andrea Arcangeli 2011-02-03 14:36 ` Rik van Riel 2011-02-03 19:11 ` Andrea Arcangeli 2011-02-12 1:28 ` Simon Kirby 2011-02-14 2:25 ` Shaohua Li 2011-02-22 14:25 ` Mel Gorman 2011-02-22 14:42 ` Andrea Arcangeli 2011-02-22 14:50 ` Mel Gorman 2011-02-22 14:54 ` Andrea Arcangeli 2011-02-22 16:04 ` Mel Gorman 2011-02-22 16:40 ` Rik van Riel 2011-02-23 5:29 ` Shaohua Li 2011-02-23 14:45 ` Andrea Arcangeli 2011-02-24 8:08 ` Shaohua Li 2011-02-24 9:52 ` Mel Gorman 2011-02-24 9:57 ` Mel Gorman 2011-02-24 14:27 ` Andrea Arcangeli 2011-02-24 14:04 ` Andrea Arcangeli 2011-02-25 0:51 ` Shaohua Li 2011-02-25 12:13 ` Mel Gorman 2011-02-12 9:48 ` alex shi 2011-02-22 14:24 ` Mel Gorman
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.