Both Joonsoo Kim and Minchan Kim have reported premature OOM kills. The common element is a zone-constrained allocation failings. Two factors appear to be at fault -- pgdat being considered unreclaimable prematurely and insufficient rotation of the active list. The series is in three basic parts; Patches 1-3 add per-zone stats back in. The actual stats patch is different to Minchan's as the original patch did not account for unevictable LRU which would corrupt counters. The second two patches remove approximations based on pgdat statistics. It's effectively a revert of "mm, vmstat: remove zone and node double accounting by approximating retries" but different LRU stats are used. This is better than a full revert or a reworking of the series as it preserves history of why the zone stats are necessary. If this work out, we may have to leave the double accounting in place for now until an alternative cheap solution presents itself. Patch 4 rotates inactive/active lists for lowmem allocations. This is also quite different to Minchan's patch as the original patch did not account for memcg and would rotate if *any* eligible zone needed rotation which may rotate excessively. The new patch considers the ratio for all eligible zones which is more in line with node-lru in general. Patch 5 accounts for skipped pages as partial scanned. This avoids the pgdat being prematurely marked unreclaimable while still allowing it to be marked unreclaimable if there are no reclaimable pages. These patches did not OOM for me on a 2G 32-bit KVM instance while running a stress test for an hour. Preliminary tests on a 64-bit system using a parallel dd workload did not show anything alarming. If an OOM is detected then please post the full OOM message. Optionally please test without patch 5 if an OOM occurs. include/linux/mm_inline.h | 19 ++--------- include/linux/mmzone.h | 7 ++++ include/linux/swap.h | 1 + mm/compaction.c | 20 +---------- mm/migrate.c | 2 ++ mm/page-writeback.c | 17 +++++----- mm/page_alloc.c | 59 +++++++++++---------------------- mm/vmscan.c | 84 ++++++++++++++++++++++++++++++++++++++--------- mm/vmstat.c | 6 ++++ 9 files changed, 116 insertions(+), 99 deletions(-) -- 2.6.4
Both Joonsoo Kim and Minchan Kim have reported premature OOM kills. The common element is a zone-constrained allocation failings. Two factors appear to be at fault -- pgdat being considered unreclaimable prematurely and insufficient rotation of the active list. The series is in three basic parts; Patches 1-3 add per-zone stats back in. The actual stats patch is different to Minchan's as the original patch did not account for unevictable LRU which would corrupt counters. The second two patches remove approximations based on pgdat statistics. It's effectively a revert of "mm, vmstat: remove zone and node double accounting by approximating retries" but different LRU stats are used. This is better than a full revert or a reworking of the series as it preserves history of why the zone stats are necessary. If this work out, we may have to leave the double accounting in place for now until an alternative cheap solution presents itself. Patch 4 rotates inactive/active lists for lowmem allocations. This is also quite different to Minchan's patch as the original patch did not account for memcg and would rotate if *any* eligible zone needed rotation which may rotate excessively. The new patch considers the ratio for all eligible zones which is more in line with node-lru in general. Patch 5 accounts for skipped pages as partial scanned. This avoids the pgdat being prematurely marked unreclaimable while still allowing it to be marked unreclaimable if there are no reclaimable pages. These patches did not OOM for me on a 2G 32-bit KVM instance while running a stress test for an hour. Preliminary tests on a 64-bit system using a parallel dd workload did not show anything alarming. If an OOM is detected then please post the full OOM message. Optionally please test without patch 5 if an OOM occurs. include/linux/mm_inline.h | 19 ++--------- include/linux/mmzone.h | 7 ++++ include/linux/swap.h | 1 + mm/compaction.c | 20 +---------- mm/migrate.c | 2 ++ mm/page-writeback.c | 17 +++++----- mm/page_alloc.c | 59 +++++++++++---------------------- mm/vmscan.c | 84 ++++++++++++++++++++++++++++++++++++++--------- mm/vmstat.c | 6 ++++ 9 files changed, 116 insertions(+), 99 deletions(-) -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
From: Minchan Kim <minchan@kernel.org> While I did stress test with hackbench, I got OOM message frequently which didn't ever happen in zone-lru. gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0 .. .. [<c71a76e2>] __alloc_pages_nodemask+0xe52/0xe60 [<c71f31dc>] ? new_slab+0x39c/0x3b0 [<c71f31dc>] new_slab+0x39c/0x3b0 [<c71f4eca>] ___slab_alloc.constprop.87+0x6da/0x840 [<c763e6fc>] ? __alloc_skb+0x3c/0x260 [<c777e127>] ? _raw_spin_unlock_irq+0x27/0x60 [<c70cebfc>] ? trace_hardirqs_on_caller+0xec/0x1b0 [<c70a1506>] ? finish_task_switch+0xa6/0x220 [<c7219ee0>] ? poll_select_copy_remaining+0x140/0x140 [<c7201645>] __slab_alloc.isra.81.constprop.86+0x40/0x6d [<c763e6fc>] ? __alloc_skb+0x3c/0x260 [<c71f525c>] kmem_cache_alloc+0x22c/0x260 [<c763e6fc>] ? __alloc_skb+0x3c/0x260 [<c763e6fc>] __alloc_skb+0x3c/0x260 [<c763eece>] alloc_skb_with_frags+0x4e/0x1a0 [<c7638d6a>] sock_alloc_send_pskb+0x16a/0x1b0 [<c770b581>] ? wait_for_unix_gc+0x31/0x90 [<c71cfb1d>] ? alloc_set_pte+0x2ad/0x310 [<c77084dd>] unix_stream_sendmsg+0x28d/0x340 [<c7634dad>] sock_sendmsg+0x2d/0x40 [<c7634e2c>] sock_write_iter+0x6c/0xc0 [<c7204a90>] __vfs_write+0xc0/0x120 [<c72053ab>] vfs_write+0x9b/0x1a0 [<c71cc4a9>] ? __might_fault+0x49/0xa0 [<c72062c4>] SyS_write+0x44/0x90 [<c70036c6>] do_fast_syscall_32+0xa6/0x1e0 [<c777ea2c>] sysenter_past_esp+0x45/0x74 Mem-Info: active_anon:104698 inactive_anon:105791 isolated_anon:192 active_file:433 inactive_file:283 isolated_file:22 unevictable:0 dirty:0 writeback:296 unstable:0 slab_reclaimable:6389 slab_unreclaimable:78927 mapped:474 shmem:0 pagetables:101426 bounce:0 free:10518 free_pcp:334 free_cma:0 Node 0 active_anon:418792kB inactive_anon:423164kB active_file:1732kB inactive_file:1132kB unevictable:0kB isolated(anon):768kB isolated(file):88kB mapped:1896kB dirty:0kB writeback:1184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1478632 all_unreclaimable? yes DMA free:3304kB min:68kB low:84kB high:100kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:4088kB kernel_stack:0kB pagetables:2480kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 809 1965 1965 Normal free:3436kB min:3604kB low:4504kB high:5404kB present:897016kB managed:858460kB mlocked:0kB slab_reclaimable:25556kB slab_unreclaimable:311712kB kernel_stack:164608kB pagetables:30844kB bounce:0kB free_pcp:620kB local_pcp:104kB free_cma:0kB lowmem_reserve[]: 0 0 9247 9247 HighMem free:33808kB min:512kB low:1796kB high:3080kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:372252kB bounce:0kB free_pcp:428kB local_pcp:72kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 2*4kB (UM) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 0*1024kB 1*2048kB (U) 0*4096kB = 3192kB Normal: 33*4kB (MH) 79*8kB (ME) 11*16kB (M) 4*32kB (M) 2*64kB (ME) 2*128kB (EH) 7*256kB (EH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3244kB HighMem: 2590*4kB (UM) 1568*8kB (UM) 491*16kB (UM) 60*32kB (UM) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33064kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 25121 total pagecache pages 24160 pages in swap cache Swap cache stats: add 86371, delete 62211, find 42865/60187 Free swap = 4015560kB Total swap = 4192252kB 524186 pages RAM 295934 pages HighMem/MovableOnly 9658 pages reserved 0 pages cma reserved The order-0 allocation for normal zone failed while there are a lot of reclaimable memory(i.e., anonymous memory with free swap). I wanted to analyze the problem but it was hard because we removed per-zone lru stat so I couldn't know how many of anonymous memory there are in normal/dma zone. When we investigate OOM problem, reclaimable memory count is crucial stat to find a problem. Without it, it's hard to parse the OOM message so I believe we should keep it. With per-zone lru stat, gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0 Mem-Info: active_anon:101103 inactive_anon:102219 isolated_anon:0 active_file:503 inactive_file:544 isolated_file:0 unevictable:0 dirty:0 writeback:34 unstable:0 slab_reclaimable:6298 slab_unreclaimable:74669 mapped:863 shmem:0 pagetables:100998 bounce:0 free:23573 free_pcp:1861 free_cma:0 Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 809 1965 1965 Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB lowmem_reserve[]: 0 0 9247 9247 HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 54409 total pagecache pages 53215 pages in swap cache Swap cache stats: add 300982, delete 247765, find 157978/226539 Free swap = 3803244kB Total swap = 4192252kB 524186 pages RAM 295934 pages HighMem/MovableOnly 9642 pages reserved 0 pages cma reserved With that, we can see normal zone has a 86M reclaimable memory so we can know something goes wrong(I will fix the problem in next patch) in reclaim. Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- include/linux/mm_inline.h | 2 ++ include/linux/mmzone.h | 6 ++++++ mm/page_alloc.c | 10 ++++++++++ mm/vmscan.c | 9 --------- mm/vmstat.c | 5 +++++ 5 files changed, 23 insertions(+), 9 deletions(-) diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index bcc4ed07fa90..9cc130f5feb2 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -45,6 +45,8 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec, struct pglist_data *pgdat = lruvec_pgdat(lruvec); __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages); + __mod_zone_page_state(&pgdat->node_zones[zid], + NR_ZONE_LRU_BASE + lru, nr_pages); acct_highmem_file_pages(zid, lru, nr_pages); } diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e6aca07cedb7..72625b04e9ba 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -110,6 +110,12 @@ struct zone_padding { enum zone_stat_item { /* First 128 byte cacheline (assuming 64 bit words) */ NR_FREE_PAGES, + NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */ + NR_ZONE_INACTIVE_ANON = NR_ZONE_LRU_BASE, + NR_ZONE_ACTIVE_ANON, + NR_ZONE_INACTIVE_FILE, + NR_ZONE_ACTIVE_FILE, + NR_ZONE_UNEVICTABLE, NR_MLOCK, /* mlock()ed pages found and moved off LRU */ NR_SLAB_RECLAIMABLE, NR_SLAB_UNRECLAIMABLE, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 830ad49a584a..b44c9a8d879a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4388,6 +4388,11 @@ void show_free_areas(unsigned int filter) " min:%lukB" " low:%lukB" " high:%lukB" + " active_anon:%lukB" + " inactive_anon:%lukB" + " active_file:%lukB" + " inactive_file:%lukB" + " unevictable:%lukB" " present:%lukB" " managed:%lukB" " mlocked:%lukB" @@ -4405,6 +4410,11 @@ void show_free_areas(unsigned int filter) K(min_wmark_pages(zone)), K(low_wmark_pages(zone)), K(high_wmark_pages(zone)), + K(zone_page_state(zone, NR_ZONE_ACTIVE_ANON)), + K(zone_page_state(zone, NR_ZONE_INACTIVE_ANON)), + K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)), + K(zone_page_state(zone, NR_ZONE_INACTIVE_FILE)), + K(zone_page_state(zone, NR_ZONE_UNEVICTABLE)), K(zone->present_pages), K(zone->managed_pages), K(zone_page_state(zone, NR_MLOCK)), diff --git a/mm/vmscan.c b/mm/vmscan.c index 22aec2bcfeec..222d5403dd4b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1359,23 +1359,14 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec, enum lru_list lru, unsigned long *nr_zone_taken, unsigned long nr_taken) { -#ifdef CONFIG_HIGHMEM int zid; - /* - * Highmem has separate accounting for highmem pages so each zone - * is updated separately. - */ for (zid = 0; zid < MAX_NR_ZONES; zid++) { if (!nr_zone_taken[zid]) continue; __update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]); } -#else - /* Zone ID does not matter on !HIGHMEM */ - __update_lru_size(lruvec, lru, 0, -nr_taken); -#endif #ifdef CONFIG_MEMCG mem_cgroup_update_lru_size(lruvec, lru, -nr_taken); diff --git a/mm/vmstat.c b/mm/vmstat.c index 91ecca96dcae..f10aad81a9a3 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -921,6 +921,11 @@ int fragmentation_index(struct zone *zone, unsigned int order) const char * const vmstat_text[] = { /* enum zone_stat_item countes */ "nr_free_pages", + "nr_inactive_anon", + "nr_active_anon", + "nr_inactive_file", + "nr_active_file", + "nr_unevictable", "nr_mlock", "nr_slab_reclaimable", "nr_slab_unreclaimable", -- 2.6.4
From: Minchan Kim <minchan@kernel.org> While I did stress test with hackbench, I got OOM message frequently which didn't ever happen in zone-lru. gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0 .. .. [<c71a76e2>] __alloc_pages_nodemask+0xe52/0xe60 [<c71f31dc>] ? new_slab+0x39c/0x3b0 [<c71f31dc>] new_slab+0x39c/0x3b0 [<c71f4eca>] ___slab_alloc.constprop.87+0x6da/0x840 [<c763e6fc>] ? __alloc_skb+0x3c/0x260 [<c777e127>] ? _raw_spin_unlock_irq+0x27/0x60 [<c70cebfc>] ? trace_hardirqs_on_caller+0xec/0x1b0 [<c70a1506>] ? finish_task_switch+0xa6/0x220 [<c7219ee0>] ? poll_select_copy_remaining+0x140/0x140 [<c7201645>] __slab_alloc.isra.81.constprop.86+0x40/0x6d [<c763e6fc>] ? __alloc_skb+0x3c/0x260 [<c71f525c>] kmem_cache_alloc+0x22c/0x260 [<c763e6fc>] ? __alloc_skb+0x3c/0x260 [<c763e6fc>] __alloc_skb+0x3c/0x260 [<c763eece>] alloc_skb_with_frags+0x4e/0x1a0 [<c7638d6a>] sock_alloc_send_pskb+0x16a/0x1b0 [<c770b581>] ? wait_for_unix_gc+0x31/0x90 [<c71cfb1d>] ? alloc_set_pte+0x2ad/0x310 [<c77084dd>] unix_stream_sendmsg+0x28d/0x340 [<c7634dad>] sock_sendmsg+0x2d/0x40 [<c7634e2c>] sock_write_iter+0x6c/0xc0 [<c7204a90>] __vfs_write+0xc0/0x120 [<c72053ab>] vfs_write+0x9b/0x1a0 [<c71cc4a9>] ? __might_fault+0x49/0xa0 [<c72062c4>] SyS_write+0x44/0x90 [<c70036c6>] do_fast_syscall_32+0xa6/0x1e0 [<c777ea2c>] sysenter_past_esp+0x45/0x74 Mem-Info: active_anon:104698 inactive_anon:105791 isolated_anon:192 active_file:433 inactive_file:283 isolated_file:22 unevictable:0 dirty:0 writeback:296 unstable:0 slab_reclaimable:6389 slab_unreclaimable:78927 mapped:474 shmem:0 pagetables:101426 bounce:0 free:10518 free_pcp:334 free_cma:0 Node 0 active_anon:418792kB inactive_anon:423164kB active_file:1732kB inactive_file:1132kB unevictable:0kB isolated(anon):768kB isolated(file):88kB mapped:1896kB dirty:0kB writeback:1184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1478632 all_unreclaimable? yes DMA free:3304kB min:68kB low:84kB high:100kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:4088kB kernel_stack:0kB pagetables:2480kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 809 1965 1965 Normal free:3436kB min:3604kB low:4504kB high:5404kB present:897016kB managed:858460kB mlocked:0kB slab_reclaimable:25556kB slab_unreclaimable:311712kB kernel_stack:164608kB pagetables:30844kB bounce:0kB free_pcp:620kB local_pcp:104kB free_cma:0kB lowmem_reserve[]: 0 0 9247 9247 HighMem free:33808kB min:512kB low:1796kB high:3080kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:372252kB bounce:0kB free_pcp:428kB local_pcp:72kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 2*4kB (UM) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 0*1024kB 1*2048kB (U) 0*4096kB = 3192kB Normal: 33*4kB (MH) 79*8kB (ME) 11*16kB (M) 4*32kB (M) 2*64kB (ME) 2*128kB (EH) 7*256kB (EH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3244kB HighMem: 2590*4kB (UM) 1568*8kB (UM) 491*16kB (UM) 60*32kB (UM) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33064kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 25121 total pagecache pages 24160 pages in swap cache Swap cache stats: add 86371, delete 62211, find 42865/60187 Free swap = 4015560kB Total swap = 4192252kB 524186 pages RAM 295934 pages HighMem/MovableOnly 9658 pages reserved 0 pages cma reserved The order-0 allocation for normal zone failed while there are a lot of reclaimable memory(i.e., anonymous memory with free swap). I wanted to analyze the problem but it was hard because we removed per-zone lru stat so I couldn't know how many of anonymous memory there are in normal/dma zone. When we investigate OOM problem, reclaimable memory count is crucial stat to find a problem. Without it, it's hard to parse the OOM message so I believe we should keep it. With per-zone lru stat, gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0 Mem-Info: active_anon:101103 inactive_anon:102219 isolated_anon:0 active_file:503 inactive_file:544 isolated_file:0 unevictable:0 dirty:0 writeback:34 unstable:0 slab_reclaimable:6298 slab_unreclaimable:74669 mapped:863 shmem:0 pagetables:100998 bounce:0 free:23573 free_pcp:1861 free_cma:0 Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 809 1965 1965 Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB lowmem_reserve[]: 0 0 9247 9247 HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 54409 total pagecache pages 53215 pages in swap cache Swap cache stats: add 300982, delete 247765, find 157978/226539 Free swap = 3803244kB Total swap = 4192252kB 524186 pages RAM 295934 pages HighMem/MovableOnly 9642 pages reserved 0 pages cma reserved With that, we can see normal zone has a 86M reclaimable memory so we can know something goes wrong(I will fix the problem in next patch) in reclaim. Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- include/linux/mm_inline.h | 2 ++ include/linux/mmzone.h | 6 ++++++ mm/page_alloc.c | 10 ++++++++++ mm/vmscan.c | 9 --------- mm/vmstat.c | 5 +++++ 5 files changed, 23 insertions(+), 9 deletions(-) diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index bcc4ed07fa90..9cc130f5feb2 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -45,6 +45,8 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec, struct pglist_data *pgdat = lruvec_pgdat(lruvec); __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages); + __mod_zone_page_state(&pgdat->node_zones[zid], + NR_ZONE_LRU_BASE + lru, nr_pages); acct_highmem_file_pages(zid, lru, nr_pages); } diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e6aca07cedb7..72625b04e9ba 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -110,6 +110,12 @@ struct zone_padding { enum zone_stat_item { /* First 128 byte cacheline (assuming 64 bit words) */ NR_FREE_PAGES, + NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */ + NR_ZONE_INACTIVE_ANON = NR_ZONE_LRU_BASE, + NR_ZONE_ACTIVE_ANON, + NR_ZONE_INACTIVE_FILE, + NR_ZONE_ACTIVE_FILE, + NR_ZONE_UNEVICTABLE, NR_MLOCK, /* mlock()ed pages found and moved off LRU */ NR_SLAB_RECLAIMABLE, NR_SLAB_UNRECLAIMABLE, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 830ad49a584a..b44c9a8d879a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4388,6 +4388,11 @@ void show_free_areas(unsigned int filter) " min:%lukB" " low:%lukB" " high:%lukB" + " active_anon:%lukB" + " inactive_anon:%lukB" + " active_file:%lukB" + " inactive_file:%lukB" + " unevictable:%lukB" " present:%lukB" " managed:%lukB" " mlocked:%lukB" @@ -4405,6 +4410,11 @@ void show_free_areas(unsigned int filter) K(min_wmark_pages(zone)), K(low_wmark_pages(zone)), K(high_wmark_pages(zone)), + K(zone_page_state(zone, NR_ZONE_ACTIVE_ANON)), + K(zone_page_state(zone, NR_ZONE_INACTIVE_ANON)), + K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)), + K(zone_page_state(zone, NR_ZONE_INACTIVE_FILE)), + K(zone_page_state(zone, NR_ZONE_UNEVICTABLE)), K(zone->present_pages), K(zone->managed_pages), K(zone_page_state(zone, NR_MLOCK)), diff --git a/mm/vmscan.c b/mm/vmscan.c index 22aec2bcfeec..222d5403dd4b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1359,23 +1359,14 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec, enum lru_list lru, unsigned long *nr_zone_taken, unsigned long nr_taken) { -#ifdef CONFIG_HIGHMEM int zid; - /* - * Highmem has separate accounting for highmem pages so each zone - * is updated separately. - */ for (zid = 0; zid < MAX_NR_ZONES; zid++) { if (!nr_zone_taken[zid]) continue; __update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]); } -#else - /* Zone ID does not matter on !HIGHMEM */ - __update_lru_size(lruvec, lru, 0, -nr_taken); -#endif #ifdef CONFIG_MEMCG mem_cgroup_update_lru_size(lruvec, lru, -nr_taken); diff --git a/mm/vmstat.c b/mm/vmstat.c index 91ecca96dcae..f10aad81a9a3 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -921,6 +921,11 @@ int fragmentation_index(struct zone *zone, unsigned int order) const char * const vmstat_text[] = { /* enum zone_stat_item countes */ "nr_free_pages", + "nr_inactive_anon", + "nr_active_anon", + "nr_inactive_file", + "nr_active_file", + "nr_unevictable", "nr_mlock", "nr_slab_reclaimable", "nr_slab_unreclaimable", -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
With the reintroduction of per-zone LRU stats, highmem_file_pages is redundant so remove it. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- include/linux/mm_inline.h | 17 ----------------- mm/page-writeback.c | 12 ++++-------- 2 files changed, 4 insertions(+), 25 deletions(-) diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 9cc130f5feb2..71613e8a720f 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -4,22 +4,6 @@ #include <linux/huge_mm.h> #include <linux/swap.h> -#ifdef CONFIG_HIGHMEM -extern atomic_t highmem_file_pages; - -static inline void acct_highmem_file_pages(int zid, enum lru_list lru, - int nr_pages) -{ - if (is_highmem_idx(zid) && is_file_lru(lru)) - atomic_add(nr_pages, &highmem_file_pages); -} -#else -static inline void acct_highmem_file_pages(int zid, enum lru_list lru, - int nr_pages) -{ -} -#endif - /** * page_is_file_cache - should the page be on a file LRU or anon LRU? * @page: the page to test @@ -47,7 +31,6 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec, __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages); __mod_zone_page_state(&pgdat->node_zones[zid], NR_ZONE_LRU_BASE + lru, nr_pages); - acct_highmem_file_pages(zid, lru, nr_pages); } static __always_inline void update_lru_size(struct lruvec *lruvec, diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 573d138fa7a5..cfa78124c3c2 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -299,17 +299,13 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat) return nr_pages; } -#ifdef CONFIG_HIGHMEM -atomic_t highmem_file_pages; -#endif static unsigned long highmem_dirtyable_memory(unsigned long total) { #ifdef CONFIG_HIGHMEM int node; - unsigned long x; + unsigned long x = 0; int i; - unsigned long dirtyable = 0; for_each_node_state(node, N_HIGH_MEMORY) { for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) { @@ -326,12 +322,12 @@ static unsigned long highmem_dirtyable_memory(unsigned long total) nr_pages = zone_page_state(z, NR_FREE_PAGES); /* watch for underflows */ nr_pages -= min(nr_pages, high_wmark_pages(z)); - dirtyable += nr_pages; + nr_pages += zone_page_state(z, NR_INACTIVE_FILE); + nr_pages += zone_page_state(z, NR_ACTIVE_FILE); + x += nr_pages; } } - x = dirtyable + atomic_read(&highmem_file_pages); - /* * Unreclaimable memory (kernel memory or anonymous memory * without swap) can bring down the dirtyable pages below -- 2.6.4
With the reintroduction of per-zone LRU stats, highmem_file_pages is redundant so remove it. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- include/linux/mm_inline.h | 17 ----------------- mm/page-writeback.c | 12 ++++-------- 2 files changed, 4 insertions(+), 25 deletions(-) diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 9cc130f5feb2..71613e8a720f 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -4,22 +4,6 @@ #include <linux/huge_mm.h> #include <linux/swap.h> -#ifdef CONFIG_HIGHMEM -extern atomic_t highmem_file_pages; - -static inline void acct_highmem_file_pages(int zid, enum lru_list lru, - int nr_pages) -{ - if (is_highmem_idx(zid) && is_file_lru(lru)) - atomic_add(nr_pages, &highmem_file_pages); -} -#else -static inline void acct_highmem_file_pages(int zid, enum lru_list lru, - int nr_pages) -{ -} -#endif - /** * page_is_file_cache - should the page be on a file LRU or anon LRU? * @page: the page to test @@ -47,7 +31,6 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec, __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages); __mod_zone_page_state(&pgdat->node_zones[zid], NR_ZONE_LRU_BASE + lru, nr_pages); - acct_highmem_file_pages(zid, lru, nr_pages); } static __always_inline void update_lru_size(struct lruvec *lruvec, diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 573d138fa7a5..cfa78124c3c2 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -299,17 +299,13 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat) return nr_pages; } -#ifdef CONFIG_HIGHMEM -atomic_t highmem_file_pages; -#endif static unsigned long highmem_dirtyable_memory(unsigned long total) { #ifdef CONFIG_HIGHMEM int node; - unsigned long x; + unsigned long x = 0; int i; - unsigned long dirtyable = 0; for_each_node_state(node, N_HIGH_MEMORY) { for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) { @@ -326,12 +322,12 @@ static unsigned long highmem_dirtyable_memory(unsigned long total) nr_pages = zone_page_state(z, NR_FREE_PAGES); /* watch for underflows */ nr_pages -= min(nr_pages, high_wmark_pages(z)); - dirtyable += nr_pages; + nr_pages += zone_page_state(z, NR_INACTIVE_FILE); + nr_pages += zone_page_state(z, NR_ACTIVE_FILE); + x += nr_pages; } } - x = dirtyable + atomic_read(&highmem_file_pages); - /* * Unreclaimable memory (kernel memory or anonymous memory * without swap) can bring down the dirtyable pages below -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
If per-zone LRU accounting is available then there is no point approximating whether reclaim and compaction should retry based on pgdat statistics. This is effectively a revert of "mm, vmstat: remove zone and node double accounting by approximating retries" with the difference that inactive/active stats are still available. This preserves the history of why the approximation was retried and why it had to be reverted to handle OOM kills on 32-bit systems. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- include/linux/mmzone.h | 1 + include/linux/swap.h | 1 + mm/compaction.c | 20 +------------------- mm/migrate.c | 2 ++ mm/page-writeback.c | 5 +++++ mm/page_alloc.c | 49 ++++++++++--------------------------------------- mm/vmscan.c | 18 ++++++++++++++++++ mm/vmstat.c | 1 + 8 files changed, 39 insertions(+), 58 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 72625b04e9ba..f2e4e90621ec 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -116,6 +116,7 @@ enum zone_stat_item { NR_ZONE_INACTIVE_FILE, NR_ZONE_ACTIVE_FILE, NR_ZONE_UNEVICTABLE, + NR_ZONE_WRITE_PENDING, /* Count of dirty, writeback and unstable pages */ NR_MLOCK, /* mlock()ed pages found and moved off LRU */ NR_SLAB_RECLAIMABLE, NR_SLAB_UNRECLAIMABLE, diff --git a/include/linux/swap.h b/include/linux/swap.h index cc753c639e3d..b17cc4830fa6 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -307,6 +307,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page, struct vm_area_struct *vma); /* linux/mm/vmscan.c */ +extern unsigned long zone_reclaimable_pages(struct zone *zone); extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat); extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); diff --git a/mm/compaction.c b/mm/compaction.c index cd93ea24c565..e5995f38d677 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1438,11 +1438,6 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, { struct zone *zone; struct zoneref *z; - pg_data_t *last_pgdat = NULL; - - /* Do not retry compaction for zone-constrained allocations */ - if (ac->high_zoneidx < ZONE_NORMAL) - return false; /* * Make sure at least one zone would pass __compaction_suitable if we continue @@ -1453,27 +1448,14 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, unsigned long available; enum compact_result compact_result; - if (last_pgdat == zone->zone_pgdat) - continue; - - /* - * This over-estimates the number of pages available for - * reclaim/compaction but walking the LRU would take too - * long. The consequences are that compaction may retry - * longer than it should for a zone-constrained allocation - * request. - */ - last_pgdat = zone->zone_pgdat; - available = pgdat_reclaimable_pages(zone->zone_pgdat) / order; - /* * Do not consider all the reclaimable memory because we do not * want to trash just for a single high order allocation which * is even not guaranteed to appear even if __compaction_suitable * is happy about the watermark check. */ + available = zone_reclaimable_pages(zone) / order; available += zone_page_state_snapshot(zone, NR_FREE_PAGES); - available = min(zone->managed_pages, available); compact_result = __compaction_suitable(zone, order, alloc_flags, ac_classzone_idx(ac), available); if (compact_result != COMPACT_SKIPPED && diff --git a/mm/migrate.c b/mm/migrate.c index ed2f85e61de1..ed0268268e93 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -513,7 +513,9 @@ int migrate_page_move_mapping(struct address_space *mapping, } if (dirty && mapping_cap_account_dirty(mapping)) { __dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY); + __dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING); __inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY); + __inc_zone_state(newzone, NR_ZONE_WRITE_PENDING); } } local_irq_enable(); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index cfa78124c3c2..7e9061ec040b 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2462,6 +2462,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping) mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY); __inc_node_page_state(page, NR_FILE_DIRTY); + __inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); __inc_node_page_state(page, NR_DIRTIED); __inc_wb_stat(wb, WB_RECLAIMABLE); __inc_wb_stat(wb, WB_DIRTIED); @@ -2483,6 +2484,7 @@ void account_page_cleaned(struct page *page, struct address_space *mapping, if (mapping_cap_account_dirty(mapping)) { mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY); dec_node_page_state(page, NR_FILE_DIRTY); + dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); dec_wb_stat(wb, WB_RECLAIMABLE); task_io_account_cancelled_write(PAGE_SIZE); } @@ -2739,6 +2741,7 @@ int clear_page_dirty_for_io(struct page *page) if (TestClearPageDirty(page)) { mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY); dec_node_page_state(page, NR_FILE_DIRTY); + dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); dec_wb_stat(wb, WB_RECLAIMABLE); ret = 1; } @@ -2785,6 +2788,7 @@ int test_clear_page_writeback(struct page *page) if (ret) { mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK); dec_node_page_state(page, NR_WRITEBACK); + dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); inc_node_page_state(page, NR_WRITTEN); } unlock_page_memcg(page); @@ -2839,6 +2843,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write) if (!ret) { mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK); inc_node_page_state(page, NR_WRITEBACK); + inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); } unlock_page_memcg(page); return ret; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b44c9a8d879a..afb254e22235 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3434,7 +3434,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, { struct zone *zone; struct zoneref *z; - pg_data_t *current_pgdat = NULL; /* * Make sure we converge to OOM if we cannot make any progress @@ -3444,15 +3443,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, return false; /* - * Blindly retry lowmem allocation requests that are often ignored by - * the OOM killer up to MAX_RECLAIM_RETRIES as we not have a reliable - * and fast means of calculating reclaimable, dirty and writeback pages - * in eligible zones. - */ - if (ac->high_zoneidx < ZONE_NORMAL) - goto out; - - /* * Keep reclaiming pages while there is a chance this will lead somewhere. * If none of the target zones can satisfy our allocation request even * if all reclaimable pages are considered then we are screwed and have @@ -3462,38 +3452,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, ac->nodemask) { unsigned long available; unsigned long reclaimable; - int zid; - if (current_pgdat == zone->zone_pgdat) - continue; - - current_pgdat = zone->zone_pgdat; - available = reclaimable = pgdat_reclaimable_pages(current_pgdat); + available = reclaimable = zone_reclaimable_pages(zone); available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES); - - /* Account for all free pages on eligible zones */ - for (zid = 0; zid <= zone_idx(zone); zid++) { - struct zone *acct_zone = ¤t_pgdat->node_zones[zid]; - - available += zone_page_state_snapshot(acct_zone, NR_FREE_PAGES); - } + available += zone_page_state_snapshot(zone, NR_FREE_PAGES); /* * Would the allocation succeed if we reclaimed the whole - * available? This is approximate because there is no - * accurate count of reclaimable pages per zone. + * available? */ - for (zid = 0; zid <= zone_idx(zone); zid++) { - struct zone *check_zone = ¤t_pgdat->node_zones[zid]; - unsigned long estimate; - - estimate = min(check_zone->managed_pages, available); - if (!__zone_watermark_ok(check_zone, order, - min_wmark_pages(check_zone), ac_classzone_idx(ac), - alloc_flags, estimate)) - continue; - + if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), + ac_classzone_idx(ac), alloc_flags, available)) { /* * If we didn't make any progress and have a lot of * dirty + writeback pages then we should wait for @@ -3503,16 +3473,15 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, if (!did_some_progress) { unsigned long write_pending; - write_pending = - node_page_state(current_pgdat, NR_WRITEBACK) + - node_page_state(current_pgdat, NR_FILE_DIRTY); + write_pending = zone_page_state_snapshot(zone, + NR_ZONE_WRITE_PENDING); if (2 * write_pending > reclaimable) { congestion_wait(BLK_RW_ASYNC, HZ/10); return true; } } -out: + /* * Memory allocation/reclaim might be called from a WQ * context and the current implementation of the WQ @@ -4393,6 +4362,7 @@ void show_free_areas(unsigned int filter) " active_file:%lukB" " inactive_file:%lukB" " unevictable:%lukB" + " writepending:%lukB" " present:%lukB" " managed:%lukB" " mlocked:%lukB" @@ -4415,6 +4385,7 @@ void show_free_areas(unsigned int filter) K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)), K(zone_page_state(zone, NR_ZONE_INACTIVE_FILE)), K(zone_page_state(zone, NR_ZONE_UNEVICTABLE)), + K(zone_page_state(zone, NR_ZONE_WRITE_PENDING)), K(zone->present_pages), K(zone->managed_pages), K(zone_page_state(zone, NR_MLOCK)), diff --git a/mm/vmscan.c b/mm/vmscan.c index 222d5403dd4b..134381a20099 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -194,6 +194,24 @@ static bool sane_reclaim(struct scan_control *sc) } #endif +/* + * This misses isolated pages which are not accounted for to save counters. + * As the data only determines if reclaim or compaction continues, it is + * not expected that isolated pages will be a dominating factor. + */ +unsigned long zone_reclaimable_pages(struct zone *zone) +{ + unsigned long nr; + + nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) + + zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE); + if (get_nr_swap_pages() > 0) + nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) + + zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON); + + return nr; +} + unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat) { unsigned long nr; diff --git a/mm/vmstat.c b/mm/vmstat.c index f10aad81a9a3..e1a46906c61b 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -926,6 +926,7 @@ const char * const vmstat_text[] = { "nr_inactive_file", "nr_active_file", "nr_unevictable", + "nr_zone_write_pending", "nr_mlock", "nr_slab_reclaimable", "nr_slab_unreclaimable", -- 2.6.4
If per-zone LRU accounting is available then there is no point approximating whether reclaim and compaction should retry based on pgdat statistics. This is effectively a revert of "mm, vmstat: remove zone and node double accounting by approximating retries" with the difference that inactive/active stats are still available. This preserves the history of why the approximation was retried and why it had to be reverted to handle OOM kills on 32-bit systems. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- include/linux/mmzone.h | 1 + include/linux/swap.h | 1 + mm/compaction.c | 20 +------------------- mm/migrate.c | 2 ++ mm/page-writeback.c | 5 +++++ mm/page_alloc.c | 49 ++++++++++--------------------------------------- mm/vmscan.c | 18 ++++++++++++++++++ mm/vmstat.c | 1 + 8 files changed, 39 insertions(+), 58 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 72625b04e9ba..f2e4e90621ec 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -116,6 +116,7 @@ enum zone_stat_item { NR_ZONE_INACTIVE_FILE, NR_ZONE_ACTIVE_FILE, NR_ZONE_UNEVICTABLE, + NR_ZONE_WRITE_PENDING, /* Count of dirty, writeback and unstable pages */ NR_MLOCK, /* mlock()ed pages found and moved off LRU */ NR_SLAB_RECLAIMABLE, NR_SLAB_UNRECLAIMABLE, diff --git a/include/linux/swap.h b/include/linux/swap.h index cc753c639e3d..b17cc4830fa6 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -307,6 +307,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page, struct vm_area_struct *vma); /* linux/mm/vmscan.c */ +extern unsigned long zone_reclaimable_pages(struct zone *zone); extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat); extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); diff --git a/mm/compaction.c b/mm/compaction.c index cd93ea24c565..e5995f38d677 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1438,11 +1438,6 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, { struct zone *zone; struct zoneref *z; - pg_data_t *last_pgdat = NULL; - - /* Do not retry compaction for zone-constrained allocations */ - if (ac->high_zoneidx < ZONE_NORMAL) - return false; /* * Make sure at least one zone would pass __compaction_suitable if we continue @@ -1453,27 +1448,14 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, unsigned long available; enum compact_result compact_result; - if (last_pgdat == zone->zone_pgdat) - continue; - - /* - * This over-estimates the number of pages available for - * reclaim/compaction but walking the LRU would take too - * long. The consequences are that compaction may retry - * longer than it should for a zone-constrained allocation - * request. - */ - last_pgdat = zone->zone_pgdat; - available = pgdat_reclaimable_pages(zone->zone_pgdat) / order; - /* * Do not consider all the reclaimable memory because we do not * want to trash just for a single high order allocation which * is even not guaranteed to appear even if __compaction_suitable * is happy about the watermark check. */ + available = zone_reclaimable_pages(zone) / order; available += zone_page_state_snapshot(zone, NR_FREE_PAGES); - available = min(zone->managed_pages, available); compact_result = __compaction_suitable(zone, order, alloc_flags, ac_classzone_idx(ac), available); if (compact_result != COMPACT_SKIPPED && diff --git a/mm/migrate.c b/mm/migrate.c index ed2f85e61de1..ed0268268e93 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -513,7 +513,9 @@ int migrate_page_move_mapping(struct address_space *mapping, } if (dirty && mapping_cap_account_dirty(mapping)) { __dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY); + __dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING); __inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY); + __inc_zone_state(newzone, NR_ZONE_WRITE_PENDING); } } local_irq_enable(); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index cfa78124c3c2..7e9061ec040b 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2462,6 +2462,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping) mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY); __inc_node_page_state(page, NR_FILE_DIRTY); + __inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); __inc_node_page_state(page, NR_DIRTIED); __inc_wb_stat(wb, WB_RECLAIMABLE); __inc_wb_stat(wb, WB_DIRTIED); @@ -2483,6 +2484,7 @@ void account_page_cleaned(struct page *page, struct address_space *mapping, if (mapping_cap_account_dirty(mapping)) { mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY); dec_node_page_state(page, NR_FILE_DIRTY); + dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); dec_wb_stat(wb, WB_RECLAIMABLE); task_io_account_cancelled_write(PAGE_SIZE); } @@ -2739,6 +2741,7 @@ int clear_page_dirty_for_io(struct page *page) if (TestClearPageDirty(page)) { mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY); dec_node_page_state(page, NR_FILE_DIRTY); + dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); dec_wb_stat(wb, WB_RECLAIMABLE); ret = 1; } @@ -2785,6 +2788,7 @@ int test_clear_page_writeback(struct page *page) if (ret) { mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK); dec_node_page_state(page, NR_WRITEBACK); + dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); inc_node_page_state(page, NR_WRITTEN); } unlock_page_memcg(page); @@ -2839,6 +2843,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write) if (!ret) { mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK); inc_node_page_state(page, NR_WRITEBACK); + inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); } unlock_page_memcg(page); return ret; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b44c9a8d879a..afb254e22235 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3434,7 +3434,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, { struct zone *zone; struct zoneref *z; - pg_data_t *current_pgdat = NULL; /* * Make sure we converge to OOM if we cannot make any progress @@ -3444,15 +3443,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, return false; /* - * Blindly retry lowmem allocation requests that are often ignored by - * the OOM killer up to MAX_RECLAIM_RETRIES as we not have a reliable - * and fast means of calculating reclaimable, dirty and writeback pages - * in eligible zones. - */ - if (ac->high_zoneidx < ZONE_NORMAL) - goto out; - - /* * Keep reclaiming pages while there is a chance this will lead somewhere. * If none of the target zones can satisfy our allocation request even * if all reclaimable pages are considered then we are screwed and have @@ -3462,38 +3452,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, ac->nodemask) { unsigned long available; unsigned long reclaimable; - int zid; - if (current_pgdat == zone->zone_pgdat) - continue; - - current_pgdat = zone->zone_pgdat; - available = reclaimable = pgdat_reclaimable_pages(current_pgdat); + available = reclaimable = zone_reclaimable_pages(zone); available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES); - - /* Account for all free pages on eligible zones */ - for (zid = 0; zid <= zone_idx(zone); zid++) { - struct zone *acct_zone = ¤t_pgdat->node_zones[zid]; - - available += zone_page_state_snapshot(acct_zone, NR_FREE_PAGES); - } + available += zone_page_state_snapshot(zone, NR_FREE_PAGES); /* * Would the allocation succeed if we reclaimed the whole - * available? This is approximate because there is no - * accurate count of reclaimable pages per zone. + * available? */ - for (zid = 0; zid <= zone_idx(zone); zid++) { - struct zone *check_zone = ¤t_pgdat->node_zones[zid]; - unsigned long estimate; - - estimate = min(check_zone->managed_pages, available); - if (!__zone_watermark_ok(check_zone, order, - min_wmark_pages(check_zone), ac_classzone_idx(ac), - alloc_flags, estimate)) - continue; - + if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), + ac_classzone_idx(ac), alloc_flags, available)) { /* * If we didn't make any progress and have a lot of * dirty + writeback pages then we should wait for @@ -3503,16 +3473,15 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, if (!did_some_progress) { unsigned long write_pending; - write_pending = - node_page_state(current_pgdat, NR_WRITEBACK) + - node_page_state(current_pgdat, NR_FILE_DIRTY); + write_pending = zone_page_state_snapshot(zone, + NR_ZONE_WRITE_PENDING); if (2 * write_pending > reclaimable) { congestion_wait(BLK_RW_ASYNC, HZ/10); return true; } } -out: + /* * Memory allocation/reclaim might be called from a WQ * context and the current implementation of the WQ @@ -4393,6 +4362,7 @@ void show_free_areas(unsigned int filter) " active_file:%lukB" " inactive_file:%lukB" " unevictable:%lukB" + " writepending:%lukB" " present:%lukB" " managed:%lukB" " mlocked:%lukB" @@ -4415,6 +4385,7 @@ void show_free_areas(unsigned int filter) K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)), K(zone_page_state(zone, NR_ZONE_INACTIVE_FILE)), K(zone_page_state(zone, NR_ZONE_UNEVICTABLE)), + K(zone_page_state(zone, NR_ZONE_WRITE_PENDING)), K(zone->present_pages), K(zone->managed_pages), K(zone_page_state(zone, NR_MLOCK)), diff --git a/mm/vmscan.c b/mm/vmscan.c index 222d5403dd4b..134381a20099 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -194,6 +194,24 @@ static bool sane_reclaim(struct scan_control *sc) } #endif +/* + * This misses isolated pages which are not accounted for to save counters. + * As the data only determines if reclaim or compaction continues, it is + * not expected that isolated pages will be a dominating factor. + */ +unsigned long zone_reclaimable_pages(struct zone *zone) +{ + unsigned long nr; + + nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) + + zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE); + if (get_nr_swap_pages() > 0) + nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) + + zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON); + + return nr; +} + unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat) { unsigned long nr; diff --git a/mm/vmstat.c b/mm/vmstat.c index f10aad81a9a3..e1a46906c61b 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -926,6 +926,7 @@ const char * const vmstat_text[] = { "nr_inactive_file", "nr_active_file", "nr_unevictable", + "nr_zone_write_pending", "nr_mlock", "nr_slab_reclaimable", "nr_slab_unreclaimable", -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
From: Minchan Kim <minchan@kernel.org> Minchan Kim reported that with per-zone lru state it was possible to identify that a normal zone with 8^M anonymous pages could trigger OOM with non-atomic order-0 allocations as all pages in the zone were in the active list. gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0 Call Trace: [<c51a76e2>] __alloc_pages_nodemask+0xe52/0xe60 [<c51f31dc>] ? new_slab+0x39c/0x3b0 [<c51f31dc>] new_slab+0x39c/0x3b0 [<c51f4eca>] ___slab_alloc.constprop.87+0x6da/0x840 [<c563e6fc>] ? __alloc_skb+0x3c/0x260 [<c50b8e93>] ? enqueue_task_fair+0x73/0xbf0 [<c5219ee0>] ? poll_select_copy_remaining+0x140/0x140 [<c5201645>] __slab_alloc.isra.81.constprop.86+0x40/0x6d [<c563e6fc>] ? __alloc_skb+0x3c/0x260 [<c51f525c>] kmem_cache_alloc+0x22c/0x260 [<c563e6fc>] ? __alloc_skb+0x3c/0x260 [<c563e6fc>] __alloc_skb+0x3c/0x260 [<c563eece>] alloc_skb_with_frags+0x4e/0x1a0 [<c5638d6a>] sock_alloc_send_pskb+0x16a/0x1b0 [<c570b581>] ? wait_for_unix_gc+0x31/0x90 [<c57084dd>] unix_stream_sendmsg+0x28d/0x340 [<c5634dad>] sock_sendmsg+0x2d/0x40 [<c5634e2c>] sock_write_iter+0x6c/0xc0 [<c5204a90>] __vfs_write+0xc0/0x120 [<c52053ab>] vfs_write+0x9b/0x1a0 [<c51cc4a9>] ? __might_fault+0x49/0xa0 [<c52062c4>] SyS_write+0x44/0x90 [<c50036c6>] do_fast_syscall_32+0xa6/0x1e0 Mem-Info: active_anon:101103 inactive_anon:102219 isolated_anon:0 active_file:503 inactive_file:544 isolated_file:0 unevictable:0 dirty:0 writeback:34 unstable:0 slab_reclaimable:6298 slab_unreclaimable:74669 mapped:863 shmem:0 pagetables:100998 bounce:0 free:23573 free_pcp:1861 free_cma:0 Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 809 1965 1965 Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB lowmem_reserve[]: 0 0 9247 9247 HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 54409 total pagecache pages 53215 pages in swap cache Swap cache stats: add 300982, delete 247765, find 157978/226539 Free swap = 3803244kB Total swap = 4192252kB 524186 pages RAM 295934 pages HighMem/MovableOnly 9642 pages reserved 0 pages cma reserved The problem is due to the active deactivation logic in inactive_list_is_low. Node 0 active_anon:404412kB inactive_anon:409040kB IOW, (inactive_anon of node * inactive_ratio > active_anon of node) due to highmem anonymous stat so VM never deactivates normal zone's anonymous pages. This patch is a modified version of Minchan's original solution but based upon it. The problem with Minchan's patch is that it didn't take memcg into account and any low zone with an imbalanced list could force a rotation. In this patch, a zone-constrained global reclaim will rotate the list if the inactive/active ratio of all eligible zones needs to be corrected. It is possible that higher zone pages will be initially rotated prematurely but this is the safer choice to maintain overall LRU age. Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- mm/vmscan.c | 37 ++++++++++++++++++++++++++++++++----- 1 file changed, 32 insertions(+), 5 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 134381a20099..6810d81f60c7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1964,7 +1964,8 @@ static void shrink_active_list(unsigned long nr_to_scan, * 1TB 101 10GB * 10TB 320 32GB */ -static bool inactive_list_is_low(struct lruvec *lruvec, bool file) +static bool inactive_list_is_low(struct lruvec *lruvec, bool file, + struct scan_control *sc) { unsigned long inactive_ratio; unsigned long inactive; @@ -1981,6 +1982,32 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file) inactive = lruvec_lru_size(lruvec, file * LRU_FILE); active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE); + /* + * For global reclaim on zone-constrained allocations, it is necessary + * to check if rotations are required for lowmem to be reclaimed. This + * calculates the inactive/active pages available in eligible zones. + */ + if (global_reclaim(sc)) { + struct pglist_data *pgdat = lruvec_pgdat(lruvec); + int zid; + + for (zid = sc->reclaim_idx + 1; zid < MAX_NR_ZONES; zid++) { + struct zone *zone = &pgdat->node_zones[zid]; + unsigned long inactive_zone, active_zone; + + if (!populated_zone(zone)) + continue; + + inactive_zone = zone_page_state(zone, + NR_ZONE_LRU_BASE + (file * LRU_FILE)); + active_zone = zone_page_state(zone, + NR_ZONE_LRU_BASE + (file * LRU_FILE) + LRU_ACTIVE); + + inactive -= min(inactive, inactive_zone); + active -= min(active, active_zone); + } + } + gb = (inactive + active) >> (30 - PAGE_SHIFT); if (gb) inactive_ratio = int_sqrt(10 * gb); @@ -1994,7 +2021,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, struct lruvec *lruvec, struct scan_control *sc) { if (is_active_lru(lru)) { - if (inactive_list_is_low(lruvec, is_file_lru(lru))) + if (inactive_list_is_low(lruvec, is_file_lru(lru), sc)) shrink_active_list(nr_to_scan, lruvec, sc, lru); return 0; } @@ -2125,7 +2152,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg, * lruvec even if it has plenty of old anonymous pages unless the * system is under heavy pressure. */ - if (!inactive_list_is_low(lruvec, true) && + if (!inactive_list_is_low(lruvec, true, sc) && lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) { scan_balance = SCAN_FILE; goto out; @@ -2367,7 +2394,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc * Even if we did not try to evict anon pages at all, we want to * rebalance the anon lru active/inactive ratio. */ - if (inactive_list_is_low(lruvec, false)) + if (inactive_list_is_low(lruvec, false, sc)) shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, LRU_ACTIVE_ANON); @@ -3020,7 +3047,7 @@ static void age_active_anon(struct pglist_data *pgdat, do { struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg); - if (inactive_list_is_low(lruvec, false)) + if (inactive_list_is_low(lruvec, false, sc)) shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, LRU_ACTIVE_ANON); -- 2.6.4
From: Minchan Kim <minchan@kernel.org> Minchan Kim reported that with per-zone lru state it was possible to identify that a normal zone with 8^M anonymous pages could trigger OOM with non-atomic order-0 allocations as all pages in the zone were in the active list. gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0 Call Trace: [<c51a76e2>] __alloc_pages_nodemask+0xe52/0xe60 [<c51f31dc>] ? new_slab+0x39c/0x3b0 [<c51f31dc>] new_slab+0x39c/0x3b0 [<c51f4eca>] ___slab_alloc.constprop.87+0x6da/0x840 [<c563e6fc>] ? __alloc_skb+0x3c/0x260 [<c50b8e93>] ? enqueue_task_fair+0x73/0xbf0 [<c5219ee0>] ? poll_select_copy_remaining+0x140/0x140 [<c5201645>] __slab_alloc.isra.81.constprop.86+0x40/0x6d [<c563e6fc>] ? __alloc_skb+0x3c/0x260 [<c51f525c>] kmem_cache_alloc+0x22c/0x260 [<c563e6fc>] ? __alloc_skb+0x3c/0x260 [<c563e6fc>] __alloc_skb+0x3c/0x260 [<c563eece>] alloc_skb_with_frags+0x4e/0x1a0 [<c5638d6a>] sock_alloc_send_pskb+0x16a/0x1b0 [<c570b581>] ? wait_for_unix_gc+0x31/0x90 [<c57084dd>] unix_stream_sendmsg+0x28d/0x340 [<c5634dad>] sock_sendmsg+0x2d/0x40 [<c5634e2c>] sock_write_iter+0x6c/0xc0 [<c5204a90>] __vfs_write+0xc0/0x120 [<c52053ab>] vfs_write+0x9b/0x1a0 [<c51cc4a9>] ? __might_fault+0x49/0xa0 [<c52062c4>] SyS_write+0x44/0x90 [<c50036c6>] do_fast_syscall_32+0xa6/0x1e0 Mem-Info: active_anon:101103 inactive_anon:102219 isolated_anon:0 active_file:503 inactive_file:544 isolated_file:0 unevictable:0 dirty:0 writeback:34 unstable:0 slab_reclaimable:6298 slab_unreclaimable:74669 mapped:863 shmem:0 pagetables:100998 bounce:0 free:23573 free_pcp:1861 free_cma:0 Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 809 1965 1965 Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB lowmem_reserve[]: 0 0 9247 9247 HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 54409 total pagecache pages 53215 pages in swap cache Swap cache stats: add 300982, delete 247765, find 157978/226539 Free swap = 3803244kB Total swap = 4192252kB 524186 pages RAM 295934 pages HighMem/MovableOnly 9642 pages reserved 0 pages cma reserved The problem is due to the active deactivation logic in inactive_list_is_low. Node 0 active_anon:404412kB inactive_anon:409040kB IOW, (inactive_anon of node * inactive_ratio > active_anon of node) due to highmem anonymous stat so VM never deactivates normal zone's anonymous pages. This patch is a modified version of Minchan's original solution but based upon it. The problem with Minchan's patch is that it didn't take memcg into account and any low zone with an imbalanced list could force a rotation. In this patch, a zone-constrained global reclaim will rotate the list if the inactive/active ratio of all eligible zones needs to be corrected. It is possible that higher zone pages will be initially rotated prematurely but this is the safer choice to maintain overall LRU age. Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- mm/vmscan.c | 37 ++++++++++++++++++++++++++++++++----- 1 file changed, 32 insertions(+), 5 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 134381a20099..6810d81f60c7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1964,7 +1964,8 @@ static void shrink_active_list(unsigned long nr_to_scan, * 1TB 101 10GB * 10TB 320 32GB */ -static bool inactive_list_is_low(struct lruvec *lruvec, bool file) +static bool inactive_list_is_low(struct lruvec *lruvec, bool file, + struct scan_control *sc) { unsigned long inactive_ratio; unsigned long inactive; @@ -1981,6 +1982,32 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file) inactive = lruvec_lru_size(lruvec, file * LRU_FILE); active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE); + /* + * For global reclaim on zone-constrained allocations, it is necessary + * to check if rotations are required for lowmem to be reclaimed. This + * calculates the inactive/active pages available in eligible zones. + */ + if (global_reclaim(sc)) { + struct pglist_data *pgdat = lruvec_pgdat(lruvec); + int zid; + + for (zid = sc->reclaim_idx + 1; zid < MAX_NR_ZONES; zid++) { + struct zone *zone = &pgdat->node_zones[zid]; + unsigned long inactive_zone, active_zone; + + if (!populated_zone(zone)) + continue; + + inactive_zone = zone_page_state(zone, + NR_ZONE_LRU_BASE + (file * LRU_FILE)); + active_zone = zone_page_state(zone, + NR_ZONE_LRU_BASE + (file * LRU_FILE) + LRU_ACTIVE); + + inactive -= min(inactive, inactive_zone); + active -= min(active, active_zone); + } + } + gb = (inactive + active) >> (30 - PAGE_SHIFT); if (gb) inactive_ratio = int_sqrt(10 * gb); @@ -1994,7 +2021,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, struct lruvec *lruvec, struct scan_control *sc) { if (is_active_lru(lru)) { - if (inactive_list_is_low(lruvec, is_file_lru(lru))) + if (inactive_list_is_low(lruvec, is_file_lru(lru), sc)) shrink_active_list(nr_to_scan, lruvec, sc, lru); return 0; } @@ -2125,7 +2152,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg, * lruvec even if it has plenty of old anonymous pages unless the * system is under heavy pressure. */ - if (!inactive_list_is_low(lruvec, true) && + if (!inactive_list_is_low(lruvec, true, sc) && lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) { scan_balance = SCAN_FILE; goto out; @@ -2367,7 +2394,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc * Even if we did not try to evict anon pages at all, we want to * rebalance the anon lru active/inactive ratio. */ - if (inactive_list_is_low(lruvec, false)) + if (inactive_list_is_low(lruvec, false, sc)) shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, LRU_ACTIVE_ANON); @@ -3020,7 +3047,7 @@ static void age_active_anon(struct pglist_data *pgdat, do { struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg); - if (inactive_list_is_low(lruvec, false)) + if (inactive_list_is_low(lruvec, false, sc)) shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, LRU_ACTIVE_ANON); -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Page reclaim determines whether a pgdat is unreclaimable by examining how many pages have been scanned since a page was freed and comparing that to the LRU sizes. Skipped pages are not reclaim candidates but contribute to scanned. This can prematurely mark a pgdat as unreclaimable and trigger an OOM kill. This patch accounts for skipped pages as a partial scan so that an unreclaimable pgdat will still be marked as such but by scaling the cost of a skip, it'll avoid the pgdat being marked prematurely. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- mm/vmscan.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 6810d81f60c7..e5af357dd4ac 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1424,7 +1424,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, LIST_HEAD(pages_skipped); for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan && - !list_empty(src); scan++) { + !list_empty(src);) { struct page *page; page = lru_to_page(src); @@ -1438,6 +1438,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, continue; } + /* + * Account for scanned and skipped separetly to avoid the pgdat + * being prematurely marked unreclaimable by pgdat_reclaimable. + */ + scan++; + switch (__isolate_lru_page(page, mode)) { case 0: nr_pages = hpage_nr_pages(page); @@ -1465,14 +1471,24 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, */ if (!list_empty(&pages_skipped)) { int zid; + unsigned long total_skipped = 0; - list_splice(&pages_skipped, src); for (zid = 0; zid < MAX_NR_ZONES; zid++) { if (!nr_skipped[zid]) continue; __count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]); + total_skipped += nr_skipped[zid]; } + + /* + * Account skipped pages as a partial scan as the pgdat may be + * close to unreclaimable. If the LRU list is empty, account + * skipped pages as a full scan. + */ + scan += list_empty(src) ? total_skipped : total_skipped >> 2; + + list_splice(&pages_skipped, src); } *nr_scanned = scan; trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan, -- 2.6.4
Page reclaim determines whether a pgdat is unreclaimable by examining how many pages have been scanned since a page was freed and comparing that to the LRU sizes. Skipped pages are not reclaim candidates but contribute to scanned. This can prematurely mark a pgdat as unreclaimable and trigger an OOM kill. This patch accounts for skipped pages as a partial scan so that an unreclaimable pgdat will still be marked as such but by scaling the cost of a skip, it'll avoid the pgdat being marked prematurely. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- mm/vmscan.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 6810d81f60c7..e5af357dd4ac 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1424,7 +1424,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, LIST_HEAD(pages_skipped); for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan && - !list_empty(src); scan++) { + !list_empty(src);) { struct page *page; page = lru_to_page(src); @@ -1438,6 +1438,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, continue; } + /* + * Account for scanned and skipped separetly to avoid the pgdat + * being prematurely marked unreclaimable by pgdat_reclaimable. + */ + scan++; + switch (__isolate_lru_page(page, mode)) { case 0: nr_pages = hpage_nr_pages(page); @@ -1465,14 +1471,24 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, */ if (!list_empty(&pages_skipped)) { int zid; + unsigned long total_skipped = 0; - list_splice(&pages_skipped, src); for (zid = 0; zid < MAX_NR_ZONES; zid++) { if (!nr_skipped[zid]) continue; __count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]); + total_skipped += nr_skipped[zid]; } + + /* + * Account skipped pages as a partial scan as the pgdat may be + * close to unreclaimable. If the LRU list is empty, account + * skipped pages as a full scan. + */ + scan += list_empty(src) ? total_skipped : total_skipped >> 2; + + list_splice(&pages_skipped, src); } *nr_scanned = scan; trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan, -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Thu, Jul 21, 2016 at 03:11:00PM +0100, Mel Gorman wrote: > @@ -1981,6 +1982,32 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file) > inactive = lruvec_lru_size(lruvec, file * LRU_FILE); > active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE); > > + /* > + * For global reclaim on zone-constrained allocations, it is necessary > + * to check if rotations are required for lowmem to be reclaimed. This s/rotation/deactivation/ > + * calculates the inactive/active pages available in eligible zones. > + */ > + if (global_reclaim(sc)) { > + struct pglist_data *pgdat = lruvec_pgdat(lruvec); > + int zid; > + > + for (zid = sc->reclaim_idx + 1; zid < MAX_NR_ZONES; zid++) { The emphasis on global vs. memcg reclaim is somewhat strange, because this is only about excluding pages from the balancing math that will be skipped. Memcg reclaim is never zone-restricted, but if it were, it would make sense to exclude the skipped pages there as well. Indeed, for memcg reclaim sc->reclaim_idx+1 is always MAX_NR_ZONES, and so the for loop alone will do the right thing. Can you please drop the global_reclaim() branch, the sc function parameter, and the "global reclaim" from the comment? Thanks
On Thu, Jul 21, 2016 at 03:11:00PM +0100, Mel Gorman wrote: > @@ -1981,6 +1982,32 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file) > inactive = lruvec_lru_size(lruvec, file * LRU_FILE); > active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE); > > + /* > + * For global reclaim on zone-constrained allocations, it is necessary > + * to check if rotations are required for lowmem to be reclaimed. This s/rotation/deactivation/ > + * calculates the inactive/active pages available in eligible zones. > + */ > + if (global_reclaim(sc)) { > + struct pglist_data *pgdat = lruvec_pgdat(lruvec); > + int zid; > + > + for (zid = sc->reclaim_idx + 1; zid < MAX_NR_ZONES; zid++) { The emphasis on global vs. memcg reclaim is somewhat strange, because this is only about excluding pages from the balancing math that will be skipped. Memcg reclaim is never zone-restricted, but if it were, it would make sense to exclude the skipped pages there as well. Indeed, for memcg reclaim sc->reclaim_idx+1 is always MAX_NR_ZONES, and so the for loop alone will do the right thing. Can you please drop the global_reclaim() branch, the sc function parameter, and the "global reclaim" from the comment? Thanks -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Thu, Jul 21, 2016 at 03:10:57PM +0100, Mel Gorman wrote:
> From: Minchan Kim <minchan@kernel.org>
>
> While I did stress test with hackbench, I got OOM message frequently which
> didn't ever happen in zone-lru.
>
> gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
> ..
> ..
> [<c71a76e2>] __alloc_pages_nodemask+0xe52/0xe60
> [<c71f31dc>] ? new_slab+0x39c/0x3b0
> [<c71f31dc>] new_slab+0x39c/0x3b0
> [<c71f4eca>] ___slab_alloc.constprop.87+0x6da/0x840
> [<c763e6fc>] ? __alloc_skb+0x3c/0x260
> [<c777e127>] ? _raw_spin_unlock_irq+0x27/0x60
> [<c70cebfc>] ? trace_hardirqs_on_caller+0xec/0x1b0
> [<c70a1506>] ? finish_task_switch+0xa6/0x220
> [<c7219ee0>] ? poll_select_copy_remaining+0x140/0x140
> [<c7201645>] __slab_alloc.isra.81.constprop.86+0x40/0x6d
> [<c763e6fc>] ? __alloc_skb+0x3c/0x260
> [<c71f525c>] kmem_cache_alloc+0x22c/0x260
> [<c763e6fc>] ? __alloc_skb+0x3c/0x260
> [<c763e6fc>] __alloc_skb+0x3c/0x260
> [<c763eece>] alloc_skb_with_frags+0x4e/0x1a0
> [<c7638d6a>] sock_alloc_send_pskb+0x16a/0x1b0
> [<c770b581>] ? wait_for_unix_gc+0x31/0x90
> [<c71cfb1d>] ? alloc_set_pte+0x2ad/0x310
> [<c77084dd>] unix_stream_sendmsg+0x28d/0x340
> [<c7634dad>] sock_sendmsg+0x2d/0x40
> [<c7634e2c>] sock_write_iter+0x6c/0xc0
> [<c7204a90>] __vfs_write+0xc0/0x120
> [<c72053ab>] vfs_write+0x9b/0x1a0
> [<c71cc4a9>] ? __might_fault+0x49/0xa0
> [<c72062c4>] SyS_write+0x44/0x90
> [<c70036c6>] do_fast_syscall_32+0xa6/0x1e0
> [<c777ea2c>] sysenter_past_esp+0x45/0x74
>
> Mem-Info:
> active_anon:104698 inactive_anon:105791 isolated_anon:192
> active_file:433 inactive_file:283 isolated_file:22
> unevictable:0 dirty:0 writeback:296 unstable:0
> slab_reclaimable:6389 slab_unreclaimable:78927
> mapped:474 shmem:0 pagetables:101426 bounce:0
> free:10518 free_pcp:334 free_cma:0
> Node 0 active_anon:418792kB inactive_anon:423164kB active_file:1732kB inactive_file:1132kB unevictable:0kB isolated(anon):768kB isolated(file):88kB mapped:1896kB dirty:0kB writeback:1184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1478632 all_unreclaimable? yes
> DMA free:3304kB min:68kB low:84kB high:100kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:4088kB kernel_stack:0kB pagetables:2480kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 809 1965 1965
> Normal free:3436kB min:3604kB low:4504kB high:5404kB present:897016kB managed:858460kB mlocked:0kB slab_reclaimable:25556kB slab_unreclaimable:311712kB kernel_stack:164608kB pagetables:30844kB bounce:0kB free_pcp:620kB local_pcp:104kB free_cma:0kB
> lowmem_reserve[]: 0 0 9247 9247
> HighMem free:33808kB min:512kB low:1796kB high:3080kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:372252kB bounce:0kB free_pcp:428kB local_pcp:72kB free_cma:0kB
> lowmem_reserve[]: 0 0 0 0
> DMA: 2*4kB (UM) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 0*1024kB 1*2048kB (U) 0*4096kB = 3192kB
> Normal: 33*4kB (MH) 79*8kB (ME) 11*16kB (M) 4*32kB (M) 2*64kB (ME) 2*128kB (EH) 7*256kB (EH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3244kB
> HighMem: 2590*4kB (UM) 1568*8kB (UM) 491*16kB (UM) 60*32kB (UM) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33064kB
> Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> 25121 total pagecache pages
> 24160 pages in swap cache
> Swap cache stats: add 86371, delete 62211, find 42865/60187
> Free swap = 4015560kB
> Total swap = 4192252kB
> 524186 pages RAM
> 295934 pages HighMem/MovableOnly
> 9658 pages reserved
> 0 pages cma reserved
>
> The order-0 allocation for normal zone failed while there are a lot of
> reclaimable memory(i.e., anonymous memory with free swap). I wanted to
> analyze the problem but it was hard because we removed per-zone lru stat
> so I couldn't know how many of anonymous memory there are in normal/dma zone.
>
> When we investigate OOM problem, reclaimable memory count is crucial stat
> to find a problem. Without it, it's hard to parse the OOM message so I
> believe we should keep it.
>
> With per-zone lru stat,
>
> gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
> Mem-Info:
> active_anon:101103 inactive_anon:102219 isolated_anon:0
> active_file:503 inactive_file:544 isolated_file:0
> unevictable:0 dirty:0 writeback:34 unstable:0
> slab_reclaimable:6298 slab_unreclaimable:74669
> mapped:863 shmem:0 pagetables:100998 bounce:0
> free:23573 free_pcp:1861 free_cma:0
> Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
> DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 809 1965 1965
> Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
> lowmem_reserve[]: 0 0 9247 9247
> HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
> lowmem_reserve[]: 0 0 0 0
> DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
> Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
> HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
> Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> 54409 total pagecache pages
> 53215 pages in swap cache
> Swap cache stats: add 300982, delete 247765, find 157978/226539
> Free swap = 3803244kB
> Total swap = 4192252kB
> 524186 pages RAM
> 295934 pages HighMem/MovableOnly
> 9642 pages reserved
> 0 pages cma reserved
>
> With that, we can see normal zone has a 86M reclaimable memory so we can
> know something goes wrong(I will fix the problem in next patch) in reclaim.
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Yep, makes sense to retain that insight into zones.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
On Thu, Jul 21, 2016 at 03:10:57PM +0100, Mel Gorman wrote: > From: Minchan Kim <minchan@kernel.org> > > While I did stress test with hackbench, I got OOM message frequently which > didn't ever happen in zone-lru. > > gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0 > .. > .. > [<c71a76e2>] __alloc_pages_nodemask+0xe52/0xe60 > [<c71f31dc>] ? new_slab+0x39c/0x3b0 > [<c71f31dc>] new_slab+0x39c/0x3b0 > [<c71f4eca>] ___slab_alloc.constprop.87+0x6da/0x840 > [<c763e6fc>] ? __alloc_skb+0x3c/0x260 > [<c777e127>] ? _raw_spin_unlock_irq+0x27/0x60 > [<c70cebfc>] ? trace_hardirqs_on_caller+0xec/0x1b0 > [<c70a1506>] ? finish_task_switch+0xa6/0x220 > [<c7219ee0>] ? poll_select_copy_remaining+0x140/0x140 > [<c7201645>] __slab_alloc.isra.81.constprop.86+0x40/0x6d > [<c763e6fc>] ? __alloc_skb+0x3c/0x260 > [<c71f525c>] kmem_cache_alloc+0x22c/0x260 > [<c763e6fc>] ? __alloc_skb+0x3c/0x260 > [<c763e6fc>] __alloc_skb+0x3c/0x260 > [<c763eece>] alloc_skb_with_frags+0x4e/0x1a0 > [<c7638d6a>] sock_alloc_send_pskb+0x16a/0x1b0 > [<c770b581>] ? wait_for_unix_gc+0x31/0x90 > [<c71cfb1d>] ? alloc_set_pte+0x2ad/0x310 > [<c77084dd>] unix_stream_sendmsg+0x28d/0x340 > [<c7634dad>] sock_sendmsg+0x2d/0x40 > [<c7634e2c>] sock_write_iter+0x6c/0xc0 > [<c7204a90>] __vfs_write+0xc0/0x120 > [<c72053ab>] vfs_write+0x9b/0x1a0 > [<c71cc4a9>] ? __might_fault+0x49/0xa0 > [<c72062c4>] SyS_write+0x44/0x90 > [<c70036c6>] do_fast_syscall_32+0xa6/0x1e0 > [<c777ea2c>] sysenter_past_esp+0x45/0x74 > > Mem-Info: > active_anon:104698 inactive_anon:105791 isolated_anon:192 > active_file:433 inactive_file:283 isolated_file:22 > unevictable:0 dirty:0 writeback:296 unstable:0 > slab_reclaimable:6389 slab_unreclaimable:78927 > mapped:474 shmem:0 pagetables:101426 bounce:0 > free:10518 free_pcp:334 free_cma:0 > Node 0 active_anon:418792kB inactive_anon:423164kB active_file:1732kB inactive_file:1132kB unevictable:0kB isolated(anon):768kB isolated(file):88kB mapped:1896kB dirty:0kB writeback:1184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1478632 all_unreclaimable? yes > DMA free:3304kB min:68kB low:84kB high:100kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:4088kB kernel_stack:0kB pagetables:2480kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB > lowmem_reserve[]: 0 809 1965 1965 > Normal free:3436kB min:3604kB low:4504kB high:5404kB present:897016kB managed:858460kB mlocked:0kB slab_reclaimable:25556kB slab_unreclaimable:311712kB kernel_stack:164608kB pagetables:30844kB bounce:0kB free_pcp:620kB local_pcp:104kB free_cma:0kB > lowmem_reserve[]: 0 0 9247 9247 > HighMem free:33808kB min:512kB low:1796kB high:3080kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:372252kB bounce:0kB free_pcp:428kB local_pcp:72kB free_cma:0kB > lowmem_reserve[]: 0 0 0 0 > DMA: 2*4kB (UM) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 0*1024kB 1*2048kB (U) 0*4096kB = 3192kB > Normal: 33*4kB (MH) 79*8kB (ME) 11*16kB (M) 4*32kB (M) 2*64kB (ME) 2*128kB (EH) 7*256kB (EH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3244kB > HighMem: 2590*4kB (UM) 1568*8kB (UM) 491*16kB (UM) 60*32kB (UM) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33064kB > Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB > 25121 total pagecache pages > 24160 pages in swap cache > Swap cache stats: add 86371, delete 62211, find 42865/60187 > Free swap = 4015560kB > Total swap = 4192252kB > 524186 pages RAM > 295934 pages HighMem/MovableOnly > 9658 pages reserved > 0 pages cma reserved > > The order-0 allocation for normal zone failed while there are a lot of > reclaimable memory(i.e., anonymous memory with free swap). I wanted to > analyze the problem but it was hard because we removed per-zone lru stat > so I couldn't know how many of anonymous memory there are in normal/dma zone. > > When we investigate OOM problem, reclaimable memory count is crucial stat > to find a problem. Without it, it's hard to parse the OOM message so I > believe we should keep it. > > With per-zone lru stat, > > gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0 > Mem-Info: > active_anon:101103 inactive_anon:102219 isolated_anon:0 > active_file:503 inactive_file:544 isolated_file:0 > unevictable:0 dirty:0 writeback:34 unstable:0 > slab_reclaimable:6298 slab_unreclaimable:74669 > mapped:863 shmem:0 pagetables:100998 bounce:0 > free:23573 free_pcp:1861 free_cma:0 > Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes > DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB > lowmem_reserve[]: 0 809 1965 1965 > Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB > lowmem_reserve[]: 0 0 9247 9247 > HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB > lowmem_reserve[]: 0 0 0 0 > DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB > Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB > HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB > Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB > 54409 total pagecache pages > 53215 pages in swap cache > Swap cache stats: add 300982, delete 247765, find 157978/226539 > Free swap = 3803244kB > Total swap = 4192252kB > 524186 pages RAM > 295934 pages HighMem/MovableOnly > 9642 pages reserved > 0 pages cma reserved > > With that, we can see normal zone has a 86M reclaimable memory so we can > know something goes wrong(I will fix the problem in next patch) in reclaim. > > Signed-off-by: Minchan Kim <minchan@kernel.org> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Yep, makes sense to retain that insight into zones. Acked-by: Johannes Weiner <hannes@cmpxchg.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Thu, Jul 21, 2016 at 03:10:58PM +0100, Mel Gorman wrote:
> With the reintroduction of per-zone LRU stats, highmem_file_pages is
> redundant so remove it.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
On Thu, Jul 21, 2016 at 03:10:58PM +0100, Mel Gorman wrote: > With the reintroduction of per-zone LRU stats, highmem_file_pages is > redundant so remove it. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Thu, Jul 21, 2016 at 03:10:59PM +0100, Mel Gorman wrote:
> If per-zone LRU accounting is available then there is no point
> approximating whether reclaim and compaction should retry based on pgdat
> statistics. This is effectively a revert of "mm, vmstat: remove zone and
> node double accounting by approximating retries" with the difference that
> inactive/active stats are still available. This preserves the history of
> why the approximation was retried and why it had to be reverted to handle
> OOM kills on 32-bit systems.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
I like this version of should_reclaim_retry() much better ;)
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
On Thu, Jul 21, 2016 at 03:10:59PM +0100, Mel Gorman wrote: > If per-zone LRU accounting is available then there is no point > approximating whether reclaim and compaction should retry based on pgdat > statistics. This is effectively a revert of "mm, vmstat: remove zone and > node double accounting by approximating retries" with the difference that > inactive/active stats are still available. This preserves the history of > why the approximation was retried and why it had to be reverted to handle > OOM kills on 32-bit systems. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> I like this version of should_reclaim_retry() much better ;) Acked-by: Johannes Weiner <hannes@cmpxchg.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Thu, Jul 21, 2016 at 03:11:01PM +0100, Mel Gorman wrote:
> Page reclaim determines whether a pgdat is unreclaimable by examining how
> many pages have been scanned since a page was freed and comparing that to
> the LRU sizes. Skipped pages are not reclaim candidates but contribute to
> scanned. This can prematurely mark a pgdat as unreclaimable and trigger
> an OOM kill.
>
> This patch accounts for skipped pages as a partial scan so that an
> unreclaimable pgdat will still be marked as such but by scaling the cost
> of a skip, it'll avoid the pgdat being marked prematurely.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
On Thu, Jul 21, 2016 at 03:11:01PM +0100, Mel Gorman wrote: > Page reclaim determines whether a pgdat is unreclaimable by examining how > many pages have been scanned since a page was freed and comparing that to > the LRU sizes. Skipped pages are not reclaim candidates but contribute to > scanned. This can prematurely mark a pgdat as unreclaimable and trigger > an OOM kill. > > This patch accounts for skipped pages as a partial scan so that an > unreclaimable pgdat will still be marked as such but by scaling the cost > of a skip, it'll avoid the pgdat being marked prematurely. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Thu, Jul 21, 2016 at 03:10:58PM +0100, Mel Gorman wrote: > With the reintroduction of per-zone LRU stats, highmem_file_pages is > redundant so remove it. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > include/linux/mm_inline.h | 17 ----------------- > mm/page-writeback.c | 12 ++++-------- > 2 files changed, 4 insertions(+), 25 deletions(-) > > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h > index 9cc130f5feb2..71613e8a720f 100644 > --- a/include/linux/mm_inline.h > +++ b/include/linux/mm_inline.h > @@ -4,22 +4,6 @@ > #include <linux/huge_mm.h> > #include <linux/swap.h> > > -#ifdef CONFIG_HIGHMEM > -extern atomic_t highmem_file_pages; > - > -static inline void acct_highmem_file_pages(int zid, enum lru_list lru, > - int nr_pages) > -{ > - if (is_highmem_idx(zid) && is_file_lru(lru)) > - atomic_add(nr_pages, &highmem_file_pages); > -} > -#else > -static inline void acct_highmem_file_pages(int zid, enum lru_list lru, > - int nr_pages) > -{ > -} > -#endif > - > /** > * page_is_file_cache - should the page be on a file LRU or anon LRU? > * @page: the page to test > @@ -47,7 +31,6 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec, > __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages); > __mod_zone_page_state(&pgdat->node_zones[zid], > NR_ZONE_LRU_BASE + lru, nr_pages); > - acct_highmem_file_pages(zid, lru, nr_pages); > } > > static __always_inline void update_lru_size(struct lruvec *lruvec, > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 573d138fa7a5..cfa78124c3c2 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -299,17 +299,13 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat) > > return nr_pages; > } > -#ifdef CONFIG_HIGHMEM > -atomic_t highmem_file_pages; > -#endif > > static unsigned long highmem_dirtyable_memory(unsigned long total) > { > #ifdef CONFIG_HIGHMEM > int node; > - unsigned long x; > + unsigned long x = 0; > int i; > - unsigned long dirtyable = 0; > > for_each_node_state(node, N_HIGH_MEMORY) { > for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) { > @@ -326,12 +322,12 @@ static unsigned long highmem_dirtyable_memory(unsigned long total) > nr_pages = zone_page_state(z, NR_FREE_PAGES); > /* watch for underflows */ > nr_pages -= min(nr_pages, high_wmark_pages(z)); > - dirtyable += nr_pages; > + nr_pages += zone_page_state(z, NR_INACTIVE_FILE); NR_ZONE_INACTIVE_FILE > + nr_pages += zone_page_state(z, NR_ACTIVE_FILE); NR_ZONE_ACTIVE_FILE > + x += nr_pages;
On Thu, Jul 21, 2016 at 03:10:58PM +0100, Mel Gorman wrote: > With the reintroduction of per-zone LRU stats, highmem_file_pages is > redundant so remove it. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > include/linux/mm_inline.h | 17 ----------------- > mm/page-writeback.c | 12 ++++-------- > 2 files changed, 4 insertions(+), 25 deletions(-) > > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h > index 9cc130f5feb2..71613e8a720f 100644 > --- a/include/linux/mm_inline.h > +++ b/include/linux/mm_inline.h > @@ -4,22 +4,6 @@ > #include <linux/huge_mm.h> > #include <linux/swap.h> > > -#ifdef CONFIG_HIGHMEM > -extern atomic_t highmem_file_pages; > - > -static inline void acct_highmem_file_pages(int zid, enum lru_list lru, > - int nr_pages) > -{ > - if (is_highmem_idx(zid) && is_file_lru(lru)) > - atomic_add(nr_pages, &highmem_file_pages); > -} > -#else > -static inline void acct_highmem_file_pages(int zid, enum lru_list lru, > - int nr_pages) > -{ > -} > -#endif > - > /** > * page_is_file_cache - should the page be on a file LRU or anon LRU? > * @page: the page to test > @@ -47,7 +31,6 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec, > __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages); > __mod_zone_page_state(&pgdat->node_zones[zid], > NR_ZONE_LRU_BASE + lru, nr_pages); > - acct_highmem_file_pages(zid, lru, nr_pages); > } > > static __always_inline void update_lru_size(struct lruvec *lruvec, > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 573d138fa7a5..cfa78124c3c2 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -299,17 +299,13 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat) > > return nr_pages; > } > -#ifdef CONFIG_HIGHMEM > -atomic_t highmem_file_pages; > -#endif > > static unsigned long highmem_dirtyable_memory(unsigned long total) > { > #ifdef CONFIG_HIGHMEM > int node; > - unsigned long x; > + unsigned long x = 0; > int i; > - unsigned long dirtyable = 0; > > for_each_node_state(node, N_HIGH_MEMORY) { > for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) { > @@ -326,12 +322,12 @@ static unsigned long highmem_dirtyable_memory(unsigned long total) > nr_pages = zone_page_state(z, NR_FREE_PAGES); > /* watch for underflows */ > nr_pages -= min(nr_pages, high_wmark_pages(z)); > - dirtyable += nr_pages; > + nr_pages += zone_page_state(z, NR_INACTIVE_FILE); NR_ZONE_INACTIVE_FILE > + nr_pages += zone_page_state(z, NR_ACTIVE_FILE); NR_ZONE_ACTIVE_FILE > + x += nr_pages; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Thu, Jul 21, 2016 at 03:10:59PM +0100, Mel Gorman wrote:
> If per-zone LRU accounting is available then there is no point
> approximating whether reclaim and compaction should retry based on pgdat
> statistics. This is effectively a revert of "mm, vmstat: remove zone and
> node double accounting by approximating retries" with the difference that
> inactive/active stats are still available. This preserves the history of
> why the approximation was retried and why it had to be reverted to handle
> OOM kills on 32-bit systems.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Minchan Kim <minchan@kernel.org>
On Thu, Jul 21, 2016 at 03:10:59PM +0100, Mel Gorman wrote: > If per-zone LRU accounting is available then there is no point > approximating whether reclaim and compaction should retry based on pgdat > statistics. This is effectively a revert of "mm, vmstat: remove zone and > node double accounting by approximating retries" with the difference that > inactive/active stats are still available. This preserves the history of > why the approximation was retried and why it had to be reverted to handle > OOM kills on 32-bit systems. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Minchan Kim <minchan@kernel.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Thu, Jul 21, 2016 at 03:11:01PM +0100, Mel Gorman wrote:
> Page reclaim determines whether a pgdat is unreclaimable by examining how
> many pages have been scanned since a page was freed and comparing that to
> the LRU sizes. Skipped pages are not reclaim candidates but contribute to
> scanned. This can prematurely mark a pgdat as unreclaimable and trigger
> an OOM kill.
>
> This patch accounts for skipped pages as a partial scan so that an
> unreclaimable pgdat will still be marked as such but by scaling the cost
> of a skip, it'll avoid the pgdat being marked prematurely.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
> mm/vmscan.c | 20 ++++++++++++++++++--
> 1 file changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6810d81f60c7..e5af357dd4ac 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1424,7 +1424,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> LIST_HEAD(pages_skipped);
>
> for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
> - !list_empty(src); scan++) {
> + !list_empty(src);) {
> struct page *page;
>
> page = lru_to_page(src);
> @@ -1438,6 +1438,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> continue;
> }
>
> + /*
> + * Account for scanned and skipped separetly to avoid the pgdat
> + * being prematurely marked unreclaimable by pgdat_reclaimable.
> + */
> + scan++;
> +
> switch (__isolate_lru_page(page, mode)) {
> case 0:
> nr_pages = hpage_nr_pages(page);
> @@ -1465,14 +1471,24 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> */
> if (!list_empty(&pages_skipped)) {
> int zid;
> + unsigned long total_skipped = 0;
>
> - list_splice(&pages_skipped, src);
> for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> if (!nr_skipped[zid])
> continue;
>
> __count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
> + total_skipped += nr_skipped[zid];
> }
> +
> + /*
> + * Account skipped pages as a partial scan as the pgdat may be
> + * close to unreclaimable. If the LRU list is empty, account
> + * skipped pages as a full scan.
> + */
node-lru made OOM detection lengthy because a freeing of any zone will
reset NR_PAGES_SCANNED easily so that it's hard to meet a situation
pgdat_reclaimable returns *false*.
When I perform stress test, it seems I encounter the situation easily
although I have no number now.
Anyway, this patch makes sense to me because it's better than now.
About accounting scan, I supports this idea.
But still, I doubt it's okay to continue skipping pages under
irq-disabled-spin lock without any condition.
On Thu, Jul 21, 2016 at 03:11:01PM +0100, Mel Gorman wrote: > Page reclaim determines whether a pgdat is unreclaimable by examining how > many pages have been scanned since a page was freed and comparing that to > the LRU sizes. Skipped pages are not reclaim candidates but contribute to > scanned. This can prematurely mark a pgdat as unreclaimable and trigger > an OOM kill. > > This patch accounts for skipped pages as a partial scan so that an > unreclaimable pgdat will still be marked as such but by scaling the cost > of a skip, it'll avoid the pgdat being marked prematurely. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > mm/vmscan.c | 20 ++++++++++++++++++-- > 1 file changed, 18 insertions(+), 2 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 6810d81f60c7..e5af357dd4ac 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1424,7 +1424,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > LIST_HEAD(pages_skipped); > > for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan && > - !list_empty(src); scan++) { > + !list_empty(src);) { > struct page *page; > > page = lru_to_page(src); > @@ -1438,6 +1438,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > continue; > } > > + /* > + * Account for scanned and skipped separetly to avoid the pgdat > + * being prematurely marked unreclaimable by pgdat_reclaimable. > + */ > + scan++; > + > switch (__isolate_lru_page(page, mode)) { > case 0: > nr_pages = hpage_nr_pages(page); > @@ -1465,14 +1471,24 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > */ > if (!list_empty(&pages_skipped)) { > int zid; > + unsigned long total_skipped = 0; > > - list_splice(&pages_skipped, src); > for (zid = 0; zid < MAX_NR_ZONES; zid++) { > if (!nr_skipped[zid]) > continue; > > __count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]); > + total_skipped += nr_skipped[zid]; > } > + > + /* > + * Account skipped pages as a partial scan as the pgdat may be > + * close to unreclaimable. If the LRU list is empty, account > + * skipped pages as a full scan. > + */ node-lru made OOM detection lengthy because a freeing of any zone will reset NR_PAGES_SCANNED easily so that it's hard to meet a situation pgdat_reclaimable returns *false*. When I perform stress test, it seems I encounter the situation easily although I have no number now. Anyway, this patch makes sense to me because it's better than now. About accounting scan, I supports this idea. But still, I doubt it's okay to continue skipping pages under irq-disabled-spin lock without any condition. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
The wrong stat is being accumulatedin highmem_dirtyable_memory, fix it. This is a fix to the mmotm patch mm-vmscan-remove-highmem_file_pages.patch Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- mm/page-writeback.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 7e9061ec040b..f4cd7d8005c9 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -322,8 +322,8 @@ static unsigned long highmem_dirtyable_memory(unsigned long total) nr_pages = zone_page_state(z, NR_FREE_PAGES); /* watch for underflows */ nr_pages -= min(nr_pages, high_wmark_pages(z)); - nr_pages += zone_page_state(z, NR_INACTIVE_FILE); - nr_pages += zone_page_state(z, NR_ACTIVE_FILE); + nr_pages += zone_page_state(z, NR_ZONE_INACTIVE_FILE); + nr_pages += zone_page_state(z, NR_ZONE_ACTIVE_FILE); x += nr_pages; } }
The wrong stat is being accumulatedin highmem_dirtyable_memory, fix it. This is a fix to the mmotm patch mm-vmscan-remove-highmem_file_pages.patch Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- mm/page-writeback.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 7e9061ec040b..f4cd7d8005c9 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -322,8 +322,8 @@ static unsigned long highmem_dirtyable_memory(unsigned long total) nr_pages = zone_page_state(z, NR_FREE_PAGES); /* watch for underflows */ nr_pages -= min(nr_pages, high_wmark_pages(z)); - nr_pages += zone_page_state(z, NR_INACTIVE_FILE); - nr_pages += zone_page_state(z, NR_ACTIVE_FILE); + nr_pages += zone_page_state(z, NR_ZONE_INACTIVE_FILE); + nr_pages += zone_page_state(z, NR_ZONE_ACTIVE_FILE); x += nr_pages; } } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Mon, Jul 25, 2016 at 05:39:13PM +0900, Minchan Kim wrote:
> > @@ -1465,14 +1471,24 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > */
> > if (!list_empty(&pages_skipped)) {
> > int zid;
> > + unsigned long total_skipped = 0;
> >
> > - list_splice(&pages_skipped, src);
> > for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> > if (!nr_skipped[zid])
> > continue;
> >
> > __count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
> > + total_skipped += nr_skipped[zid];
> > }
> > +
> > + /*
> > + * Account skipped pages as a partial scan as the pgdat may be
> > + * close to unreclaimable. If the LRU list is empty, account
> > + * skipped pages as a full scan.
> > + */
>
> node-lru made OOM detection lengthy because a freeing of any zone will
> reset NR_PAGES_SCANNED easily so that it's hard to meet a situation
> pgdat_reclaimable returns *false*.
>
Your patch should go a long way to addressing that as it checks the zone
counters first before conducting the scan. Remember as well that the longer
detection of OOM only applies to zone-constrained allocations and there
is always the possibility that highmem shrinking of pages frees lowmem
memory if buffers are used.
--
Mel Gorman
SUSE Labs
On Mon, Jul 25, 2016 at 05:39:13PM +0900, Minchan Kim wrote: > > @@ -1465,14 +1471,24 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > > */ > > if (!list_empty(&pages_skipped)) { > > int zid; > > + unsigned long total_skipped = 0; > > > > - list_splice(&pages_skipped, src); > > for (zid = 0; zid < MAX_NR_ZONES; zid++) { > > if (!nr_skipped[zid]) > > continue; > > > > __count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]); > > + total_skipped += nr_skipped[zid]; > > } > > + > > + /* > > + * Account skipped pages as a partial scan as the pgdat may be > > + * close to unreclaimable. If the LRU list is empty, account > > + * skipped pages as a full scan. > > + */ > > node-lru made OOM detection lengthy because a freeing of any zone will > reset NR_PAGES_SCANNED easily so that it's hard to meet a situation > pgdat_reclaimable returns *false*. > Your patch should go a long way to addressing that as it checks the zone counters first before conducting the scan. Remember as well that the longer detection of OOM only applies to zone-constrained allocations and there is always the possibility that highmem shrinking of pages frees lowmem memory if buffers are used. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Thu, Jul 21, 2016 at 03:10:56PM +0100, Mel Gorman wrote: > Both Joonsoo Kim and Minchan Kim have reported premature OOM kills. > The common element is a zone-constrained allocation failings. Two factors > appear to be at fault -- pgdat being considered unreclaimable prematurely > and insufficient rotation of the active list. > > The series is in three basic parts; > > Patches 1-3 add per-zone stats back in. The actual stats patch is different > to Minchan's as the original patch did not account for unevictable > LRU which would corrupt counters. The second two patches remove > approximations based on pgdat statistics. It's effectively a > revert of "mm, vmstat: remove zone and node double accounting > by approximating retries" but different LRU stats are used. This > is better than a full revert or a reworking of the series as it > preserves history of why the zone stats are necessary. > > If this work out, we may have to leave the double accounting in > place for now until an alternative cheap solution presents itself. > > Patch 4 rotates inactive/active lists for lowmem allocations. This is also > quite different to Minchan's patch as the original patch did not > account for memcg and would rotate if *any* eligible zone needed > rotation which may rotate excessively. The new patch considers the > ratio for all eligible zones which is more in line with node-lru > in general. > > Patch 5 accounts for skipped pages as partial scanned. This avoids the pgdat > being prematurely marked unreclaimable while still allowing it to > be marked unreclaimable if there are no reclaimable pages. > > These patches did not OOM for me on a 2G 32-bit KVM instance while running > a stress test for an hour. Preliminary tests on a 64-bit system using a > parallel dd workload did not show anything alarming. > > If an OOM is detected then please post the full OOM message. Before attaching OOM message, I should note that my test case also triggers OOM in old kernel if there are four parallel file-readers. With node-lru and patch 1~5, OOM is triggered even if there are one or more parallel file-readers. With node-lru and patch 1~4, OOM is triggered if there are two or more parallel file-readers. Here goes OOM message. fork invoked oom-killer: gfp_mask=0x24200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 [108/9620] fork cpuset=/ mems_allowed=0 CPU: 0 PID: 4304 Comm: fork Not tainted 4.7.0-rc7-next-20160720+ #713 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 0000000000000000 ffff8800209ab960 ffffffff8142bd03 ffff8800209abb58 ffff8800209a0000 ffff8800209ab9d8 ffffffff81241a59 ffffffff81e70020 ffff8800209ab988 ffffffff810dddcd ffff8800209ab9a8 0000000000000206 Call Trace: [<ffffffff8142bd03>] dump_stack+0x85/0xc2 [<ffffffff81241a59>] dump_header+0x5c/0x22e [<ffffffff810dddcd>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff811b33e1>] oom_kill_process+0x221/0x3f0 [<ffffffff811b3a22>] out_of_memory+0x422/0x560 [<ffffffff811b9f69>] __alloc_pages_nodemask+0x1069/0x10c0 [<ffffffff81211a41>] ? alloc_pages_vma+0xc1/0x300 [<ffffffff81211a41>] alloc_pages_vma+0xc1/0x300 [<ffffffff811e851f>] ? wp_page_copy+0x7f/0x640 [<ffffffff811e851f>] wp_page_copy+0x7f/0x640 [<ffffffff811e974b>] do_wp_page+0x13b/0x6e0 [<ffffffff811ec704>] handle_mm_fault+0xaf4/0x1310 [<ffffffff811ebc4b>] ? handle_mm_fault+0x3b/0x1310 [<ffffffff8106eb90>] ? __do_page_fault+0x160/0x4e0 [<ffffffff8106ec19>] __do_page_fault+0x1e9/0x4e0 [<ffffffff8106efed>] trace_do_page_fault+0x5d/0x290 [<ffffffff810674ca>] do_async_page_fault+0x1a/0xa0 [<ffffffff8185bee8>] async_page_fault+0x28/0x30 [<ffffffff810a73d3>] ? __task_pid_nr_ns+0xb3/0x1b0 [<ffffffff8143ab9c>] ? __put_user_4+0x1c/0x30 [<ffffffff810b7205>] ? schedule_tail+0x55/0x70 [<ffffffff81859f3c>] ret_from_fork+0xc/0x40 Mem-Info: active_anon:26762 inactive_anon:95 isolated_anon:0 active_file:42543 inactive_file:347438 isolated_file:0 unevictable:0 dirty:0 writeback:0 unstable:0 slab_reclaimable:5476 slab_unreclaimable:23140 mapped:389534 shmem:95 pagetables:20927 bounce:0 free:6948 free_pcp:222 free_cma:0 Node 0 active_anon:107048kB inactive_anon:380kB active_file:170008kB inactive_file:1389752kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1558136kB dirty:0kB writeback:0kB shmem:0kB shmem_$ hp: 0kB shmem_pmdmapped: 0kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB pages_scanned:4697206 all_unreclaimable? yes Node 0 DMA free:2168kB min:204kB low:252kB high:300kB active_anon:3544kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB sl$ b_reclaimable:0kB slab_unreclaimable:2684kB kernel_stack:1760kB pagetables:3092kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 493 493 1955 Node 0 DMA32 free:6508kB min:6492kB low:8112kB high:9732kB active_anon:81264kB inactive_anon:0kB active_file:101204kB inactive_file:228kB unevictable:0kB writepending:0kB present:2080632kB managed:508584k$ mlocked:0kB slab_reclaimable:21904kB slab_unreclaimable:89876kB kernel_stack:46400kB pagetables:80616kB bounce:0kB free_pcp:544kB local_pcp:120kB free_cma:0kB lowmem_reserve[]: 0 0 0 1462 Node 0 Movable free:19116kB min:19256kB low:24068kB high:28880kB active_anon:22240kB inactive_anon:380kB active_file:68812kB inactive_file:1389688kB unevictable:0kB writepending:0kB present:1535864kB mana$ ed:1500964kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:368kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 0*4kB 1*8kB (M) 1*16kB (U) 1*32kB (M) 1*64kB (U) 0*128kB 0*256kB 2*512kB (UM) 1*1024kB (U) 0*2048kB 0*4096kB = 2168kB Node 0 DMA32: 51*4kB (UME) 96*8kB (ME) 46*16kB (UME) 41*32kB (ME) 32*64kB (ME) 11*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6476kB Node 0 Movable: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 1*64kB (M) 0*128kB 1*256kB (M) 1*512kB (M) 0*1024kB 1*2048kB (M) 4*4096kB (M) = 19324kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 390134 total pagecache pages 0 pages in swap cache > > Optionally please test without patch 5 if an OOM occurs. Here goes without patch 5. fork invoked oom-killer: gfp_mask=0x26000c0(GFP_KERNEL|__GFP_NOTRACK), order=0, oom_score_adj=0 [2[2/9152] fork cpuset=/ mems_allowed=0 CPU: 5 PID: 1269 Comm: fork Not tainted 4.7.0-rc7-next-20160720+ #714 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 0000000000000000 ffff8800136138e8 ffffffff8142bd23 ffff880013613ae0 ffff88000fa6ca00 ffff880013613960 ffffffff81241a79 ffffffff81e70020 ffff880013613910 ffffffff810dddcd ffff880013613930 0000000000000206 Call Trace: [<ffffffff8142bd23>] dump_stack+0x85/0xc2 [<ffffffff81241a79>] dump_header+0x5c/0x22e [<ffffffff810dddcd>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff811b33e1>] oom_kill_process+0x221/0x3f0 [<ffffffff811b3a22>] out_of_memory+0x422/0x560 [<ffffffff811b9f69>] __alloc_pages_nodemask+0x1069/0x10c0 [<ffffffff8120fb01>] ? alloc_pages_current+0xa1/0x1f0 [<ffffffff8120fb01>] alloc_pages_current+0xa1/0x1f0 [<ffffffff81219f33>] ? new_slab+0x473/0x5e0 [<ffffffff81219f33>] new_slab+0x473/0x5e0 [<ffffffff8121b16f>] ___slab_alloc+0x27f/0x550 [<ffffffff8121b491>] ? __slab_alloc+0x51/0x90 [<ffffffff81081e11>] ? copy_process.part.29+0xc11/0x1b90 [<ffffffff81081e11>] ? copy_process.part.29+0xc11/0x1b90 [<ffffffff8121b491>] __slab_alloc+0x51/0x90 [<ffffffff8121b6dc>] kmem_cache_alloc+0x20c/0x2b0 [<ffffffff81081e11>] ? copy_process.part.29+0xc11/0x1b90 [<ffffffff81081e11>] copy_process.part.29+0xc11/0x1b90 [<ffffffff81082f86>] _do_fork+0xe6/0x6a0 [<ffffffff810835e9>] SyS_clone+0x19/0x20 [<ffffffff81003e13>] do_syscall_64+0x73/0x1e0 [<ffffffff81859dc3>] entry_SYSCALL64_slow_path+0x25/0x25 Mem-Info: active_anon:26003 inactive_anon:95 isolated_anon:0 active_file:289026 inactive_file:96101 isolated_file:21 unevictable:0 dirty:0 writeback:0 unstable:0 slab_reclaimable:6056 slab_unreclaimable:23737 mapped:384788 shmem:95 pagetables:23282 bounce:0 free:7815 free_pcp:179 free_cma:0 Node 0 active_anon:104012kB inactive_anon:380kB active_file:1156104kB inactive_file:384404kB unevictable:0kB isolated(anon):0kB isolated(file):84kB mapped:1539152kB dirty:0kB writeback:0kB shmem:0kB shmem_ thp: 0kB shmem_pmdmapped: 2048kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB pages_scanned:2512936 all_unreclaimable? yes Node 0 DMA free:2172kB min:204kB low:252kB high:300kB active_anon:3204kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB sla b_reclaimable:16kB slab_unreclaimable:2944kB kernel_stack:1584kB pagetables:3188kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 493 493 1955 Node 0 DMA32 free:6320kB min:6492kB low:8112kB high:9732kB active_anon:79128kB inactive_anon:0kB active_file:69016kB inactive_file:15872kB unevictable:0kB writepending:0kB present:2080632kB managed:508584k B mlocked:0kB slab_reclaimable:24208kB slab_unreclaimable:92004kB kernel_stack:44064kB pagetables:89940kB bounce:0kB free_pcp:264kB local_pcp:100kB free_cma:0kB lowmem_reserve[]: 0 0 0 1462 Node 0 Movable free:22768kB min:19256kB low:24068kB high:28880kB active_anon:21676kB inactive_anon:380kB active_file:1085592kB inactive_file:369724kB unevictable:0kB writepending:0kB present:1535864kB mana ged:1500964kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:452kB local_pcp:80kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 3*4kB (M) 0*8kB 1*16kB (M) 1*32kB (M) 1*64kB (M) 0*128kB 2*256kB (UM) 1*512kB (M) 1*1024kB (U) 0*2048kB 0*4096kB = 2172kB Node 0 DMA32: 94*4kB (ME) 48*8kB (ME) 22*16kB (ME) 10*32kB (UME) 3*64kB (ME) 1*128kB (M) 0*256kB 2*512kB (UM) 4*1024kB (M) 0*2048kB 0*4096kB = 6872kB Node 0 Movable: 0*4kB 0*8kB 1*16kB (M) 3*32kB (M) 4*64kB (M) 1*128kB (M) 10*256kB (M) 3*512kB (M) 0*1024kB 1*2048kB (M) 4*4096kB (M) = 23024kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 385234 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Thanks. > include/linux/mm_inline.h | 19 ++--------- > include/linux/mmzone.h | 7 ++++ > include/linux/swap.h | 1 + > mm/compaction.c | 20 +---------- > mm/migrate.c | 2 ++ > mm/page-writeback.c | 17 +++++----- > mm/page_alloc.c | 59 +++++++++++---------------------- > mm/vmscan.c | 84 ++++++++++++++++++++++++++++++++++++++--------- > mm/vmstat.c | 6 ++++ > 9 files changed, 116 insertions(+), 99 deletions(-) > > -- > 2.6.4 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Thu, Jul 21, 2016 at 03:10:56PM +0100, Mel Gorman wrote: > Both Joonsoo Kim and Minchan Kim have reported premature OOM kills. > The common element is a zone-constrained allocation failings. Two factors > appear to be at fault -- pgdat being considered unreclaimable prematurely > and insufficient rotation of the active list. > > The series is in three basic parts; > > Patches 1-3 add per-zone stats back in. The actual stats patch is different > to Minchan's as the original patch did not account for unevictable > LRU which would corrupt counters. The second two patches remove > approximations based on pgdat statistics. It's effectively a > revert of "mm, vmstat: remove zone and node double accounting > by approximating retries" but different LRU stats are used. This > is better than a full revert or a reworking of the series as it > preserves history of why the zone stats are necessary. > > If this work out, we may have to leave the double accounting in > place for now until an alternative cheap solution presents itself. > > Patch 4 rotates inactive/active lists for lowmem allocations. This is also > quite different to Minchan's patch as the original patch did not > account for memcg and would rotate if *any* eligible zone needed > rotation which may rotate excessively. The new patch considers the > ratio for all eligible zones which is more in line with node-lru > in general. > > Patch 5 accounts for skipped pages as partial scanned. This avoids the pgdat > being prematurely marked unreclaimable while still allowing it to > be marked unreclaimable if there are no reclaimable pages. > > These patches did not OOM for me on a 2G 32-bit KVM instance while running > a stress test for an hour. Preliminary tests on a 64-bit system using a > parallel dd workload did not show anything alarming. > > If an OOM is detected then please post the full OOM message. Before attaching OOM message, I should note that my test case also triggers OOM in old kernel if there are four parallel file-readers. With node-lru and patch 1~5, OOM is triggered even if there are one or more parallel file-readers. With node-lru and patch 1~4, OOM is triggered if there are two or more parallel file-readers. Here goes OOM message. fork invoked oom-killer: gfp_mask=0x24200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 [108/9620] fork cpuset=/ mems_allowed=0 CPU: 0 PID: 4304 Comm: fork Not tainted 4.7.0-rc7-next-20160720+ #713 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 0000000000000000 ffff8800209ab960 ffffffff8142bd03 ffff8800209abb58 ffff8800209a0000 ffff8800209ab9d8 ffffffff81241a59 ffffffff81e70020 ffff8800209ab988 ffffffff810dddcd ffff8800209ab9a8 0000000000000206 Call Trace: [<ffffffff8142bd03>] dump_stack+0x85/0xc2 [<ffffffff81241a59>] dump_header+0x5c/0x22e [<ffffffff810dddcd>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff811b33e1>] oom_kill_process+0x221/0x3f0 [<ffffffff811b3a22>] out_of_memory+0x422/0x560 [<ffffffff811b9f69>] __alloc_pages_nodemask+0x1069/0x10c0 [<ffffffff81211a41>] ? alloc_pages_vma+0xc1/0x300 [<ffffffff81211a41>] alloc_pages_vma+0xc1/0x300 [<ffffffff811e851f>] ? wp_page_copy+0x7f/0x640 [<ffffffff811e851f>] wp_page_copy+0x7f/0x640 [<ffffffff811e974b>] do_wp_page+0x13b/0x6e0 [<ffffffff811ec704>] handle_mm_fault+0xaf4/0x1310 [<ffffffff811ebc4b>] ? handle_mm_fault+0x3b/0x1310 [<ffffffff8106eb90>] ? __do_page_fault+0x160/0x4e0 [<ffffffff8106ec19>] __do_page_fault+0x1e9/0x4e0 [<ffffffff8106efed>] trace_do_page_fault+0x5d/0x290 [<ffffffff810674ca>] do_async_page_fault+0x1a/0xa0 [<ffffffff8185bee8>] async_page_fault+0x28/0x30 [<ffffffff810a73d3>] ? __task_pid_nr_ns+0xb3/0x1b0 [<ffffffff8143ab9c>] ? __put_user_4+0x1c/0x30 [<ffffffff810b7205>] ? schedule_tail+0x55/0x70 [<ffffffff81859f3c>] ret_from_fork+0xc/0x40 Mem-Info: active_anon:26762 inactive_anon:95 isolated_anon:0 active_file:42543 inactive_file:347438 isolated_file:0 unevictable:0 dirty:0 writeback:0 unstable:0 slab_reclaimable:5476 slab_unreclaimable:23140 mapped:389534 shmem:95 pagetables:20927 bounce:0 free:6948 free_pcp:222 free_cma:0 Node 0 active_anon:107048kB inactive_anon:380kB active_file:170008kB inactive_file:1389752kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1558136kB dirty:0kB writeback:0kB shmem:0kB shmem_$ hp: 0kB shmem_pmdmapped: 0kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB pages_scanned:4697206 all_unreclaimable? yes Node 0 DMA free:2168kB min:204kB low:252kB high:300kB active_anon:3544kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB sl$ b_reclaimable:0kB slab_unreclaimable:2684kB kernel_stack:1760kB pagetables:3092kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 493 493 1955 Node 0 DMA32 free:6508kB min:6492kB low:8112kB high:9732kB active_anon:81264kB inactive_anon:0kB active_file:101204kB inactive_file:228kB unevictable:0kB writepending:0kB present:2080632kB managed:508584k$ mlocked:0kB slab_reclaimable:21904kB slab_unreclaimable:89876kB kernel_stack:46400kB pagetables:80616kB bounce:0kB free_pcp:544kB local_pcp:120kB free_cma:0kB lowmem_reserve[]: 0 0 0 1462 Node 0 Movable free:19116kB min:19256kB low:24068kB high:28880kB active_anon:22240kB inactive_anon:380kB active_file:68812kB inactive_file:1389688kB unevictable:0kB writepending:0kB present:1535864kB mana$ ed:1500964kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:368kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 0*4kB 1*8kB (M) 1*16kB (U) 1*32kB (M) 1*64kB (U) 0*128kB 0*256kB 2*512kB (UM) 1*1024kB (U) 0*2048kB 0*4096kB = 2168kB Node 0 DMA32: 51*4kB (UME) 96*8kB (ME) 46*16kB (UME) 41*32kB (ME) 32*64kB (ME) 11*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6476kB Node 0 Movable: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 1*64kB (M) 0*128kB 1*256kB (M) 1*512kB (M) 0*1024kB 1*2048kB (M) 4*4096kB (M) = 19324kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 390134 total pagecache pages 0 pages in swap cache > > Optionally please test without patch 5 if an OOM occurs. Here goes without patch 5. fork invoked oom-killer: gfp_mask=0x26000c0(GFP_KERNEL|__GFP_NOTRACK), order=0, oom_score_adj=0 [2[2/9152] fork cpuset=/ mems_allowed=0 CPU: 5 PID: 1269 Comm: fork Not tainted 4.7.0-rc7-next-20160720+ #714 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 0000000000000000 ffff8800136138e8 ffffffff8142bd23 ffff880013613ae0 ffff88000fa6ca00 ffff880013613960 ffffffff81241a79 ffffffff81e70020 ffff880013613910 ffffffff810dddcd ffff880013613930 0000000000000206 Call Trace: [<ffffffff8142bd23>] dump_stack+0x85/0xc2 [<ffffffff81241a79>] dump_header+0x5c/0x22e [<ffffffff810dddcd>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff811b33e1>] oom_kill_process+0x221/0x3f0 [<ffffffff811b3a22>] out_of_memory+0x422/0x560 [<ffffffff811b9f69>] __alloc_pages_nodemask+0x1069/0x10c0 [<ffffffff8120fb01>] ? alloc_pages_current+0xa1/0x1f0 [<ffffffff8120fb01>] alloc_pages_current+0xa1/0x1f0 [<ffffffff81219f33>] ? new_slab+0x473/0x5e0 [<ffffffff81219f33>] new_slab+0x473/0x5e0 [<ffffffff8121b16f>] ___slab_alloc+0x27f/0x550 [<ffffffff8121b491>] ? __slab_alloc+0x51/0x90 [<ffffffff81081e11>] ? copy_process.part.29+0xc11/0x1b90 [<ffffffff81081e11>] ? copy_process.part.29+0xc11/0x1b90 [<ffffffff8121b491>] __slab_alloc+0x51/0x90 [<ffffffff8121b6dc>] kmem_cache_alloc+0x20c/0x2b0 [<ffffffff81081e11>] ? copy_process.part.29+0xc11/0x1b90 [<ffffffff81081e11>] copy_process.part.29+0xc11/0x1b90 [<ffffffff81082f86>] _do_fork+0xe6/0x6a0 [<ffffffff810835e9>] SyS_clone+0x19/0x20 [<ffffffff81003e13>] do_syscall_64+0x73/0x1e0 [<ffffffff81859dc3>] entry_SYSCALL64_slow_path+0x25/0x25 Mem-Info: active_anon:26003 inactive_anon:95 isolated_anon:0 active_file:289026 inactive_file:96101 isolated_file:21 unevictable:0 dirty:0 writeback:0 unstable:0 slab_reclaimable:6056 slab_unreclaimable:23737 mapped:384788 shmem:95 pagetables:23282 bounce:0 free:7815 free_pcp:179 free_cma:0 Node 0 active_anon:104012kB inactive_anon:380kB active_file:1156104kB inactive_file:384404kB unevictable:0kB isolated(anon):0kB isolated(file):84kB mapped:1539152kB dirty:0kB writeback:0kB shmem:0kB shmem_ thp: 0kB shmem_pmdmapped: 2048kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB pages_scanned:2512936 all_unreclaimable? yes Node 0 DMA free:2172kB min:204kB low:252kB high:300kB active_anon:3204kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB sla b_reclaimable:16kB slab_unreclaimable:2944kB kernel_stack:1584kB pagetables:3188kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 493 493 1955 Node 0 DMA32 free:6320kB min:6492kB low:8112kB high:9732kB active_anon:79128kB inactive_anon:0kB active_file:69016kB inactive_file:15872kB unevictable:0kB writepending:0kB present:2080632kB managed:508584k B mlocked:0kB slab_reclaimable:24208kB slab_unreclaimable:92004kB kernel_stack:44064kB pagetables:89940kB bounce:0kB free_pcp:264kB local_pcp:100kB free_cma:0kB lowmem_reserve[]: 0 0 0 1462 Node 0 Movable free:22768kB min:19256kB low:24068kB high:28880kB active_anon:21676kB inactive_anon:380kB active_file:1085592kB inactive_file:369724kB unevictable:0kB writepending:0kB present:1535864kB mana ged:1500964kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:452kB local_pcp:80kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 3*4kB (M) 0*8kB 1*16kB (M) 1*32kB (M) 1*64kB (M) 0*128kB 2*256kB (UM) 1*512kB (M) 1*1024kB (U) 0*2048kB 0*4096kB = 2172kB Node 0 DMA32: 94*4kB (ME) 48*8kB (ME) 22*16kB (ME) 10*32kB (UME) 3*64kB (ME) 1*128kB (M) 0*256kB 2*512kB (UM) 4*1024kB (M) 0*2048kB 0*4096kB = 6872kB Node 0 Movable: 0*4kB 0*8kB 1*16kB (M) 3*32kB (M) 4*64kB (M) 1*128kB (M) 10*256kB (M) 3*512kB (M) 0*1024kB 1*2048kB (M) 4*4096kB (M) = 23024kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 385234 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Thanks. > include/linux/mm_inline.h | 19 ++--------- > include/linux/mmzone.h | 7 ++++ > include/linux/swap.h | 1 + > mm/compaction.c | 20 +---------- > mm/migrate.c | 2 ++ > mm/page-writeback.c | 17 +++++----- > mm/page_alloc.c | 59 +++++++++++---------------------- > mm/vmscan.c | 84 ++++++++++++++++++++++++++++++++++++++--------- > mm/vmstat.c | 6 ++++ > 9 files changed, 116 insertions(+), 99 deletions(-) > > -- > 2.6.4 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Thu, Jul 21, 2016 at 03:11:01PM +0100, Mel Gorman wrote: > Page reclaim determines whether a pgdat is unreclaimable by examining how > many pages have been scanned since a page was freed and comparing that to > the LRU sizes. Skipped pages are not reclaim candidates but contribute to > scanned. This can prematurely mark a pgdat as unreclaimable and trigger > an OOM kill. > > This patch accounts for skipped pages as a partial scan so that an > unreclaimable pgdat will still be marked as such but by scaling the cost > of a skip, it'll avoid the pgdat being marked prematurely. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > mm/vmscan.c | 20 ++++++++++++++++++-- > 1 file changed, 18 insertions(+), 2 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 6810d81f60c7..e5af357dd4ac 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1424,7 +1424,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > LIST_HEAD(pages_skipped); > > for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan && > - !list_empty(src); scan++) { > + !list_empty(src);) { > struct page *page; > > page = lru_to_page(src); > @@ -1438,6 +1438,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > continue; > } > > + /* > + * Account for scanned and skipped separetly to avoid the pgdat > + * being prematurely marked unreclaimable by pgdat_reclaimable. > + */ > + scan++; > + This logic has potential unbounded retry problem. src would not become empty if __isolate_lru_page() return -EBUSY since we move failed page to src list again in this case. Thanks. > switch (__isolate_lru_page(page, mode)) { > case 0: > nr_pages = hpage_nr_pages(page); > @@ -1465,14 +1471,24 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > */ > if (!list_empty(&pages_skipped)) { > int zid; > + unsigned long total_skipped = 0; > > - list_splice(&pages_skipped, src); > for (zid = 0; zid < MAX_NR_ZONES; zid++) { > if (!nr_skipped[zid]) > continue; > > __count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]); > + total_skipped += nr_skipped[zid]; > } > + > + /* > + * Account skipped pages as a partial scan as the pgdat may be > + * close to unreclaimable. If the LRU list is empty, account > + * skipped pages as a full scan. > + */ > + scan += list_empty(src) ? total_skipped : total_skipped >> 2; > + > + list_splice(&pages_skipped, src); > } > *nr_scanned = scan; > trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan, > -- > 2.6.4 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Thu, Jul 21, 2016 at 03:11:01PM +0100, Mel Gorman wrote: > Page reclaim determines whether a pgdat is unreclaimable by examining how > many pages have been scanned since a page was freed and comparing that to > the LRU sizes. Skipped pages are not reclaim candidates but contribute to > scanned. This can prematurely mark a pgdat as unreclaimable and trigger > an OOM kill. > > This patch accounts for skipped pages as a partial scan so that an > unreclaimable pgdat will still be marked as such but by scaling the cost > of a skip, it'll avoid the pgdat being marked prematurely. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > mm/vmscan.c | 20 ++++++++++++++++++-- > 1 file changed, 18 insertions(+), 2 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 6810d81f60c7..e5af357dd4ac 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1424,7 +1424,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > LIST_HEAD(pages_skipped); > > for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan && > - !list_empty(src); scan++) { > + !list_empty(src);) { > struct page *page; > > page = lru_to_page(src); > @@ -1438,6 +1438,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > continue; > } > > + /* > + * Account for scanned and skipped separetly to avoid the pgdat > + * being prematurely marked unreclaimable by pgdat_reclaimable. > + */ > + scan++; > + This logic has potential unbounded retry problem. src would not become empty if __isolate_lru_page() return -EBUSY since we move failed page to src list again in this case. Thanks. > switch (__isolate_lru_page(page, mode)) { > case 0: > nr_pages = hpage_nr_pages(page); > @@ -1465,14 +1471,24 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > */ > if (!list_empty(&pages_skipped)) { > int zid; > + unsigned long total_skipped = 0; > > - list_splice(&pages_skipped, src); > for (zid = 0; zid < MAX_NR_ZONES; zid++) { > if (!nr_skipped[zid]) > continue; > > __count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]); > + total_skipped += nr_skipped[zid]; > } > + > + /* > + * Account skipped pages as a partial scan as the pgdat may be > + * close to unreclaimable. If the LRU list is empty, account > + * skipped pages as a full scan. > + */ > + scan += list_empty(src) ? total_skipped : total_skipped >> 2; > + > + list_splice(&pages_skipped, src); > } > *nr_scanned = scan; > trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan, > -- > 2.6.4 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Tue, Jul 26, 2016 at 05:16:22PM +0900, Joonsoo Kim wrote:
> On Thu, Jul 21, 2016 at 03:11:01PM +0100, Mel Gorman wrote:
> > Page reclaim determines whether a pgdat is unreclaimable by examining how
> > many pages have been scanned since a page was freed and comparing that to
> > the LRU sizes. Skipped pages are not reclaim candidates but contribute to
> > scanned. This can prematurely mark a pgdat as unreclaimable and trigger
> > an OOM kill.
> >
> > This patch accounts for skipped pages as a partial scan so that an
> > unreclaimable pgdat will still be marked as such but by scaling the cost
> > of a skip, it'll avoid the pgdat being marked prematurely.
> >
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> > mm/vmscan.c | 20 ++++++++++++++++++--
> > 1 file changed, 18 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 6810d81f60c7..e5af357dd4ac 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1424,7 +1424,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > LIST_HEAD(pages_skipped);
> >
> > for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
> > - !list_empty(src); scan++) {
> > + !list_empty(src);) {
> > struct page *page;
> >
> > page = lru_to_page(src);
> > @@ -1438,6 +1438,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > continue;
> > }
> >
> > + /*
> > + * Account for scanned and skipped separetly to avoid the pgdat
> > + * being prematurely marked unreclaimable by pgdat_reclaimable.
> > + */
> > + scan++;
> > +
>
> This logic has potential unbounded retry problem. src would not become
> empty if __isolate_lru_page() return -EBUSY since we move failed page
> to src list again in this case.
Oops.. It would not unbounded retry. It would cause needless retry but
bounded. Sorry about noise.
Thanks.
On Tue, Jul 26, 2016 at 05:16:22PM +0900, Joonsoo Kim wrote: > On Thu, Jul 21, 2016 at 03:11:01PM +0100, Mel Gorman wrote: > > Page reclaim determines whether a pgdat is unreclaimable by examining how > > many pages have been scanned since a page was freed and comparing that to > > the LRU sizes. Skipped pages are not reclaim candidates but contribute to > > scanned. This can prematurely mark a pgdat as unreclaimable and trigger > > an OOM kill. > > > > This patch accounts for skipped pages as a partial scan so that an > > unreclaimable pgdat will still be marked as such but by scaling the cost > > of a skip, it'll avoid the pgdat being marked prematurely. > > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > > --- > > mm/vmscan.c | 20 ++++++++++++++++++-- > > 1 file changed, 18 insertions(+), 2 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 6810d81f60c7..e5af357dd4ac 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1424,7 +1424,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > > LIST_HEAD(pages_skipped); > > > > for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan && > > - !list_empty(src); scan++) { > > + !list_empty(src);) { > > struct page *page; > > > > page = lru_to_page(src); > > @@ -1438,6 +1438,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, > > continue; > > } > > > > + /* > > + * Account for scanned and skipped separetly to avoid the pgdat > > + * being prematurely marked unreclaimable by pgdat_reclaimable. > > + */ > > + scan++; > > + > > This logic has potential unbounded retry problem. src would not become > empty if __isolate_lru_page() return -EBUSY since we move failed page > to src list again in this case. Oops.. It would not unbounded retry. It would cause needless retry but bounded. Sorry about noise. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Tue, Jul 26, 2016 at 05:11:30PM +0900, Joonsoo Kim wrote: > > These patches did not OOM for me on a 2G 32-bit KVM instance while running > > a stress test for an hour. Preliminary tests on a 64-bit system using a > > parallel dd workload did not show anything alarming. > > > > If an OOM is detected then please post the full OOM message. > > Before attaching OOM message, I should note that my test case also triggers > OOM in old kernel if there are four parallel file-readers. With node-lru and > patch 1~5, OOM is triggered even if there are one or more parallel file-readers. > With node-lru and patch 1~4, OOM is triggered if there are two or more > parallel file-readers. > The key there is that patch 5 allows OOM to be detected quicker. The fork workload exits after some time so it's inherently a race to see if the forked process exits before OOM is triggered or not. > <SNIP> > Mem-Info: > active_anon:26762 inactive_anon:95 isolated_anon:0 > active_file:42543 inactive_file:347438 isolated_file:0 > unevictable:0 dirty:0 writeback:0 unstable:0 > slab_reclaimable:5476 slab_unreclaimable:23140 > mapped:389534 shmem:95 pagetables:20927 bounce:0 > free:6948 free_pcp:222 free_cma:0 > Node 0 active_anon:107048kB inactive_anon:380kB active_file:170008kB inactive_file:1389752kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1558136kB dirty:0kB writeback:0kB shmem:0kB shmem_$ > hp: 0kB shmem_pmdmapped: 0kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB pages_scanned:4697206 all_unreclaimable? yes > Node 0 DMA free:2168kB min:204kB low:252kB high:300kB active_anon:3544kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB sl$ > b_reclaimable:0kB slab_unreclaimable:2684kB kernel_stack:1760kB pagetables:3092kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB > lowmem_reserve[]: 0 493 493 1955 Zone DMA is unusable > Node 0 DMA32 free:6508kB min:6492kB low:8112kB high:9732kB active_anon:81264kB inactive_anon:0kB active_file:101204kB inactive_file:228kB unevictable:0kB writepending:0kB present:2080632kB managed:508584k$ > mlocked:0kB slab_reclaimable:21904kB slab_unreclaimable:89876kB kernel_stack:46400kB pagetables:80616kB bounce:0kB free_pcp:544kB local_pcp:120kB free_cma:0kB > lowmem_reserve[]: 0 0 0 1462 Zone DMA32 has reclaimable pages but not very many and they are active. It's at the min watemark. The pgdat is unreclaimable indicating that scans are high which implies that the active file pages are due to genuine activations. > Node 0 Movable free:19116kB min:19256kB low:24068kB high:28880kB active_anon:22240kB inactive_anon:380kB active_file:68812kB inactive_file:1389688kB unevictable:0kB writepending:0kB present:1535864kB mana$ > ed:1500964kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:368kB local_pcp:0kB free_cma:0kB Zone Movable has reclaimable pages but it's at the min watermark and scanning aggressively. As the failing allocation can use all allocations, this appears to be close to a genuine OOM case. Whether it survives is down to timing of when OOM is triggered and whether the forked process exits in time or not. To some extent, it could be "addressed" by immediately reclaiming active pages moving to the inactive list at the cost of distorting page age for a workload that is genuinely close to OOM. That is similar to what zone-lru ended up doing -- fast reclaiming young pages from a zone. > > Optionally please test without patch 5 if an OOM occurs. > > Here goes without patch 5. > Causing OOM detection to be delayed. Observations on the OOM message without patch 5 are similar. Do you mind trying the following? In the patch there is a line scan += list_empty(src) ? total_skipped : total_skipped >> 2; Try scan += list_empty(src) ? total_skipped : total_skipped >> 3; scan += list_empty(src) ? total_skipped : total_skipped >> 4; scan += total_skipped >> 4; Each line slows the rate that OOM is detected but it'll be somewhat specific to your test case as it's relying to fork to exit before OOM is fired. A hackier option that is also related to the fact fork is a major source of the OOM triggering is to increase the zone reserve. That would give more space for the fork bomb while giving the file reader slightly less memory to work with. Again, what this is doing is simply altering OOM timing because indications are the stress workload is genuinely close to OOM. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 08ae8b0ef5c5..cedc8113c7a0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -201,9 +201,9 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = { 256, #endif #ifdef CONFIG_HIGHMEM - 32, + 8, #endif - 32, + 8, }; EXPORT_SYMBOL(totalram_pages); -- Mel Gorman SUSE Labs
On Tue, Jul 26, 2016 at 05:11:30PM +0900, Joonsoo Kim wrote: > > These patches did not OOM for me on a 2G 32-bit KVM instance while running > > a stress test for an hour. Preliminary tests on a 64-bit system using a > > parallel dd workload did not show anything alarming. > > > > If an OOM is detected then please post the full OOM message. > > Before attaching OOM message, I should note that my test case also triggers > OOM in old kernel if there are four parallel file-readers. With node-lru and > patch 1~5, OOM is triggered even if there are one or more parallel file-readers. > With node-lru and patch 1~4, OOM is triggered if there are two or more > parallel file-readers. > The key there is that patch 5 allows OOM to be detected quicker. The fork workload exits after some time so it's inherently a race to see if the forked process exits before OOM is triggered or not. > <SNIP> > Mem-Info: > active_anon:26762 inactive_anon:95 isolated_anon:0 > active_file:42543 inactive_file:347438 isolated_file:0 > unevictable:0 dirty:0 writeback:0 unstable:0 > slab_reclaimable:5476 slab_unreclaimable:23140 > mapped:389534 shmem:95 pagetables:20927 bounce:0 > free:6948 free_pcp:222 free_cma:0 > Node 0 active_anon:107048kB inactive_anon:380kB active_file:170008kB inactive_file:1389752kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1558136kB dirty:0kB writeback:0kB shmem:0kB shmem_$ > hp: 0kB shmem_pmdmapped: 0kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB pages_scanned:4697206 all_unreclaimable? yes > Node 0 DMA free:2168kB min:204kB low:252kB high:300kB active_anon:3544kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB sl$ > b_reclaimable:0kB slab_unreclaimable:2684kB kernel_stack:1760kB pagetables:3092kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB > lowmem_reserve[]: 0 493 493 1955 Zone DMA is unusable > Node 0 DMA32 free:6508kB min:6492kB low:8112kB high:9732kB active_anon:81264kB inactive_anon:0kB active_file:101204kB inactive_file:228kB unevictable:0kB writepending:0kB present:2080632kB managed:508584k$ > mlocked:0kB slab_reclaimable:21904kB slab_unreclaimable:89876kB kernel_stack:46400kB pagetables:80616kB bounce:0kB free_pcp:544kB local_pcp:120kB free_cma:0kB > lowmem_reserve[]: 0 0 0 1462 Zone DMA32 has reclaimable pages but not very many and they are active. It's at the min watemark. The pgdat is unreclaimable indicating that scans are high which implies that the active file pages are due to genuine activations. > Node 0 Movable free:19116kB min:19256kB low:24068kB high:28880kB active_anon:22240kB inactive_anon:380kB active_file:68812kB inactive_file:1389688kB unevictable:0kB writepending:0kB present:1535864kB mana$ > ed:1500964kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:368kB local_pcp:0kB free_cma:0kB Zone Movable has reclaimable pages but it's at the min watermark and scanning aggressively. As the failing allocation can use all allocations, this appears to be close to a genuine OOM case. Whether it survives is down to timing of when OOM is triggered and whether the forked process exits in time or not. To some extent, it could be "addressed" by immediately reclaiming active pages moving to the inactive list at the cost of distorting page age for a workload that is genuinely close to OOM. That is similar to what zone-lru ended up doing -- fast reclaiming young pages from a zone. > > Optionally please test without patch 5 if an OOM occurs. > > Here goes without patch 5. > Causing OOM detection to be delayed. Observations on the OOM message without patch 5 are similar. Do you mind trying the following? In the patch there is a line scan += list_empty(src) ? total_skipped : total_skipped >> 2; Try scan += list_empty(src) ? total_skipped : total_skipped >> 3; scan += list_empty(src) ? total_skipped : total_skipped >> 4; scan += total_skipped >> 4; Each line slows the rate that OOM is detected but it'll be somewhat specific to your test case as it's relying to fork to exit before OOM is fired. A hackier option that is also related to the fact fork is a major source of the OOM triggering is to increase the zone reserve. That would give more space for the fork bomb while giving the file reader slightly less memory to work with. Again, what this is doing is simply altering OOM timing because indications are the stress workload is genuinely close to OOM. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 08ae8b0ef5c5..cedc8113c7a0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -201,9 +201,9 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = { 256, #endif #ifdef CONFIG_HIGHMEM - 32, + 8, #endif - 32, + 8, }; EXPORT_SYMBOL(totalram_pages); -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Tue, Jul 26, 2016 at 01:50:50PM +0100, Mel Gorman wrote: > On Tue, Jul 26, 2016 at 05:11:30PM +0900, Joonsoo Kim wrote: > > > These patches did not OOM for me on a 2G 32-bit KVM instance while running > > > a stress test for an hour. Preliminary tests on a 64-bit system using a > > > parallel dd workload did not show anything alarming. > > > > > > If an OOM is detected then please post the full OOM message. > > > > Before attaching OOM message, I should note that my test case also triggers > > OOM in old kernel if there are four parallel file-readers. With node-lru and > > patch 1~5, OOM is triggered even if there are one or more parallel file-readers. > > With node-lru and patch 1~4, OOM is triggered if there are two or more > > parallel file-readers. > > > > The key there is that patch 5 allows OOM to be detected quicker. The fork > workload exits after some time so it's inherently a race to see if the > forked process exits before OOM is triggered or not. > > > <SNIP> > > Mem-Info: > > active_anon:26762 inactive_anon:95 isolated_anon:0 > > active_file:42543 inactive_file:347438 isolated_file:0 > > unevictable:0 dirty:0 writeback:0 unstable:0 > > slab_reclaimable:5476 slab_unreclaimable:23140 > > mapped:389534 shmem:95 pagetables:20927 bounce:0 > > free:6948 free_pcp:222 free_cma:0 > > Node 0 active_anon:107048kB inactive_anon:380kB active_file:170008kB inactive_file:1389752kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1558136kB dirty:0kB writeback:0kB shmem:0kB shmem_$ > > hp: 0kB shmem_pmdmapped: 0kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB pages_scanned:4697206 all_unreclaimable? yes > > Node 0 DMA free:2168kB min:204kB low:252kB high:300kB active_anon:3544kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB sl$ > > b_reclaimable:0kB slab_unreclaimable:2684kB kernel_stack:1760kB pagetables:3092kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB > > lowmem_reserve[]: 0 493 493 1955 > > Zone DMA is unusable > > > Node 0 DMA32 free:6508kB min:6492kB low:8112kB high:9732kB active_anon:81264kB inactive_anon:0kB active_file:101204kB inactive_file:228kB unevictable:0kB writepending:0kB present:2080632kB managed:508584k$ > > mlocked:0kB slab_reclaimable:21904kB slab_unreclaimable:89876kB kernel_stack:46400kB pagetables:80616kB bounce:0kB free_pcp:544kB local_pcp:120kB free_cma:0kB > > lowmem_reserve[]: 0 0 0 1462 > > Zone DMA32 has reclaimable pages but not very many and they are active. It's > at the min watemark. The pgdat is unreclaimable indicating that scans > are high which implies that the active file pages are due to genuine > activations. > > > Node 0 Movable free:19116kB min:19256kB low:24068kB high:28880kB active_anon:22240kB inactive_anon:380kB active_file:68812kB inactive_file:1389688kB unevictable:0kB writepending:0kB present:1535864kB mana$ > > ed:1500964kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:368kB local_pcp:0kB free_cma:0kB > > Zone Movable has reclaimable pages but it's at the min watermark and > scanning aggressively. > > As the failing allocation can use all allocations, this appears to be close > to a genuine OOM case. Whether it survives is down to timing of when OOM > is triggered and whether the forked process exits in time or not. > > To some extent, it could be "addressed" by immediately reclaiming active > pages moving to the inactive list at the cost of distorting page age for a > workload that is genuinely close to OOM. That is similar to what zone-lru > ended up doing -- fast reclaiming young pages from a zone. My expectation on my test case is that reclaimers should kick out actively used page and make a room for 'fork' because parallel readers would work even if reading pages are not cached. It is sensitive on reclaimers efficiency because parallel readers read pages repeatedly and disturb reclaim. I thought that it is a good test for node-lru which changes reclaimers efficiency for lower zone. However, as you said, this efficiency comes from the cost distorting page aging so now I'm not sure if it is a problem that we need to consider. Let's skip it? Anyway, thanks for tracking down the problem. > > > > Optionally please test without patch 5 if an OOM occurs. > > > > Here goes without patch 5. > > > > Causing OOM detection to be delayed. Observations on the OOM message > without patch 5 are similar. > > Do you mind trying the following? In the patch there is a line > > scan += list_empty(src) ? total_skipped : total_skipped >> 2; > > Try > > scan += list_empty(src) ? total_skipped : total_skipped >> 3; > scan += list_empty(src) ? total_skipped : total_skipped >> 4; > scan += total_skipped >> 4; Tested but all result looks like there isn't much difference. > > Each line slows the rate that OOM is detected but it'll be somewhat > specific to your test case as it's relying to fork to exit before OOM is > fired. Okay. I don't think optimizing general code to my specific test case is a good idea. Thanks.
On Tue, Jul 26, 2016 at 01:50:50PM +0100, Mel Gorman wrote: > On Tue, Jul 26, 2016 at 05:11:30PM +0900, Joonsoo Kim wrote: > > > These patches did not OOM for me on a 2G 32-bit KVM instance while running > > > a stress test for an hour. Preliminary tests on a 64-bit system using a > > > parallel dd workload did not show anything alarming. > > > > > > If an OOM is detected then please post the full OOM message. > > > > Before attaching OOM message, I should note that my test case also triggers > > OOM in old kernel if there are four parallel file-readers. With node-lru and > > patch 1~5, OOM is triggered even if there are one or more parallel file-readers. > > With node-lru and patch 1~4, OOM is triggered if there are two or more > > parallel file-readers. > > > > The key there is that patch 5 allows OOM to be detected quicker. The fork > workload exits after some time so it's inherently a race to see if the > forked process exits before OOM is triggered or not. > > > <SNIP> > > Mem-Info: > > active_anon:26762 inactive_anon:95 isolated_anon:0 > > active_file:42543 inactive_file:347438 isolated_file:0 > > unevictable:0 dirty:0 writeback:0 unstable:0 > > slab_reclaimable:5476 slab_unreclaimable:23140 > > mapped:389534 shmem:95 pagetables:20927 bounce:0 > > free:6948 free_pcp:222 free_cma:0 > > Node 0 active_anon:107048kB inactive_anon:380kB active_file:170008kB inactive_file:1389752kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1558136kB dirty:0kB writeback:0kB shmem:0kB shmem_$ > > hp: 0kB shmem_pmdmapped: 0kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB pages_scanned:4697206 all_unreclaimable? yes > > Node 0 DMA free:2168kB min:204kB low:252kB high:300kB active_anon:3544kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB sl$ > > b_reclaimable:0kB slab_unreclaimable:2684kB kernel_stack:1760kB pagetables:3092kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB > > lowmem_reserve[]: 0 493 493 1955 > > Zone DMA is unusable > > > Node 0 DMA32 free:6508kB min:6492kB low:8112kB high:9732kB active_anon:81264kB inactive_anon:0kB active_file:101204kB inactive_file:228kB unevictable:0kB writepending:0kB present:2080632kB managed:508584k$ > > mlocked:0kB slab_reclaimable:21904kB slab_unreclaimable:89876kB kernel_stack:46400kB pagetables:80616kB bounce:0kB free_pcp:544kB local_pcp:120kB free_cma:0kB > > lowmem_reserve[]: 0 0 0 1462 > > Zone DMA32 has reclaimable pages but not very many and they are active. It's > at the min watemark. The pgdat is unreclaimable indicating that scans > are high which implies that the active file pages are due to genuine > activations. > > > Node 0 Movable free:19116kB min:19256kB low:24068kB high:28880kB active_anon:22240kB inactive_anon:380kB active_file:68812kB inactive_file:1389688kB unevictable:0kB writepending:0kB present:1535864kB mana$ > > ed:1500964kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:368kB local_pcp:0kB free_cma:0kB > > Zone Movable has reclaimable pages but it's at the min watermark and > scanning aggressively. > > As the failing allocation can use all allocations, this appears to be close > to a genuine OOM case. Whether it survives is down to timing of when OOM > is triggered and whether the forked process exits in time or not. > > To some extent, it could be "addressed" by immediately reclaiming active > pages moving to the inactive list at the cost of distorting page age for a > workload that is genuinely close to OOM. That is similar to what zone-lru > ended up doing -- fast reclaiming young pages from a zone. My expectation on my test case is that reclaimers should kick out actively used page and make a room for 'fork' because parallel readers would work even if reading pages are not cached. It is sensitive on reclaimers efficiency because parallel readers read pages repeatedly and disturb reclaim. I thought that it is a good test for node-lru which changes reclaimers efficiency for lower zone. However, as you said, this efficiency comes from the cost distorting page aging so now I'm not sure if it is a problem that we need to consider. Let's skip it? Anyway, thanks for tracking down the problem. > > > > Optionally please test without patch 5 if an OOM occurs. > > > > Here goes without patch 5. > > > > Causing OOM detection to be delayed. Observations on the OOM message > without patch 5 are similar. > > Do you mind trying the following? In the patch there is a line > > scan += list_empty(src) ? total_skipped : total_skipped >> 2; > > Try > > scan += list_empty(src) ? total_skipped : total_skipped >> 3; > scan += list_empty(src) ? total_skipped : total_skipped >> 4; > scan += total_skipped >> 4; Tested but all result looks like there isn't much difference. > > Each line slows the rate that OOM is detected but it'll be somewhat > specific to your test case as it's relying to fork to exit before OOM is > fired. Okay. I don't think optimizing general code to my specific test case is a good idea. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Thu, Jul 28, 2016 at 03:44:33PM +0900, Joonsoo Kim wrote: > > To some extent, it could be "addressed" by immediately reclaiming active > > pages moving to the inactive list at the cost of distorting page age for a > > workload that is genuinely close to OOM. That is similar to what zone-lru > > ended up doing -- fast reclaiming young pages from a zone. > > My expectation on my test case is that reclaimers should kick out > actively used page and make a room for 'fork' because parallel readers > would work even if reading pages are not cached. > > It is sensitive on reclaimers efficiency because parallel readers > read pages repeatedly and disturb reclaim. I thought that it is a > good test for node-lru which changes reclaimers efficiency for lower > zone. However, as you said, this efficiency comes from the cost > distorting page aging so now I'm not sure if it is a problem that we > need to consider. Let's skip it? > I think we should skip it for now. The alterations are too specific to a test case that is very close to being genuinely OOM. Adjusting timing for one OOM case may just lead to complains that OOM is detected too slowly in others. > Anyway, thanks for tracking down the problem. > My pleasure, thanks to both you and Minchan for persisting with this as we got some important fixes out of the discussion. -- Mel Gorman SUSE Labs
On Thu, Jul 28, 2016 at 03:44:33PM +0900, Joonsoo Kim wrote: > > To some extent, it could be "addressed" by immediately reclaiming active > > pages moving to the inactive list at the cost of distorting page age for a > > workload that is genuinely close to OOM. That is similar to what zone-lru > > ended up doing -- fast reclaiming young pages from a zone. > > My expectation on my test case is that reclaimers should kick out > actively used page and make a room for 'fork' because parallel readers > would work even if reading pages are not cached. > > It is sensitive on reclaimers efficiency because parallel readers > read pages repeatedly and disturb reclaim. I thought that it is a > good test for node-lru which changes reclaimers efficiency for lower > zone. However, as you said, this efficiency comes from the cost > distorting page aging so now I'm not sure if it is a problem that we > need to consider. Let's skip it? > I think we should skip it for now. The alterations are too specific to a test case that is very close to being genuinely OOM. Adjusting timing for one OOM case may just lead to complains that OOM is detected too slowly in others. > Anyway, thanks for tracking down the problem. > My pleasure, thanks to both you and Minchan for persisting with this as we got some important fixes out of the discussion. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>