[PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v2

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v2
@ 2016-07-21 14:10 ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2016-07-21 14:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML, Mel Gorman

Both Joonsoo Kim and Minchan Kim have reported premature OOM kills.
The common element is a zone-constrained allocation failings. Two factors
appear to be at fault -- pgdat being considered unreclaimable prematurely
and insufficient rotation of the active list.

The series is in three basic parts;

Patches 1-3 add per-zone stats back in. The actual stats patch is different
	to Minchan's as the original patch did not account for unevictable
	LRU which would corrupt counters. The second two patches remove
	approximations based on pgdat statistics. It's effectively a
	revert of "mm, vmstat: remove zone and node double accounting
	by approximating retries" but different LRU stats are used. This
	is better than a full revert or a reworking of the series as it
	preserves history of why the zone stats are necessary.

	If this work out, we may have to leave the double accounting in
	place for now until an alternative cheap solution presents itself.

Patch 4 rotates inactive/active lists for lowmem allocations. This is also
	quite different to Minchan's patch as the original patch did not
	account for memcg and would rotate if *any* eligible zone needed
	rotation which may rotate excessively. The new patch considers the
	ratio for all eligible zones which is more in line with node-lru
	in general.

Patch 5 accounts for skipped pages as partial scanned. This avoids the pgdat
	being prematurely marked unreclaimable while still allowing it to
	be marked unreclaimable if there are no reclaimable pages.

These patches did not OOM for me on a 2G 32-bit KVM instance while running
a stress test for an hour. Preliminary tests on a 64-bit system using a
parallel dd workload did not show anything alarming.

If an OOM is detected then please post the full OOM message.

Optionally please test without patch 5 if an OOM occurs.

 include/linux/mm_inline.h | 19 ++---------
 include/linux/mmzone.h    |  7 ++++
 include/linux/swap.h      |  1 +
 mm/compaction.c           | 20 +----------
 mm/migrate.c              |  2 ++
 mm/page-writeback.c       | 17 +++++-----
 mm/page_alloc.c           | 59 +++++++++++----------------------
 mm/vmscan.c               | 84 ++++++++++++++++++++++++++++++++++++++---------
 mm/vmstat.c               |  6 ++++
 9 files changed, 116 insertions(+), 99 deletions(-)

-- 
2.6.4

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v2
@ 2016-07-21 14:10 ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2016-07-21 14:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML, Mel Gorman

Both Joonsoo Kim and Minchan Kim have reported premature OOM kills.
The common element is a zone-constrained allocation failings. Two factors
appear to be at fault -- pgdat being considered unreclaimable prematurely
and insufficient rotation of the active list.

The series is in three basic parts;

Patches 1-3 add per-zone stats back in. The actual stats patch is different
	to Minchan's as the original patch did not account for unevictable
	LRU which would corrupt counters. The second two patches remove
	approximations based on pgdat statistics. It's effectively a
	revert of "mm, vmstat: remove zone and node double accounting
	by approximating retries" but different LRU stats are used. This
	is better than a full revert or a reworking of the series as it
	preserves history of why the zone stats are necessary.

	If this work out, we may have to leave the double accounting in
	place for now until an alternative cheap solution presents itself.

Patch 4 rotates inactive/active lists for lowmem allocations. This is also
	quite different to Minchan's patch as the original patch did not
	account for memcg and would rotate if *any* eligible zone needed
	rotation which may rotate excessively. The new patch considers the
	ratio for all eligible zones which is more in line with node-lru
	in general.

Patch 5 accounts for skipped pages as partial scanned. This avoids the pgdat
	being prematurely marked unreclaimable while still allowing it to
	be marked unreclaimable if there are no reclaimable pages.

These patches did not OOM for me on a 2G 32-bit KVM instance while running
a stress test for an hour. Preliminary tests on a 64-bit system using a
parallel dd workload did not show anything alarming.

If an OOM is detected then please post the full OOM message.

Optionally please test without patch 5 if an OOM occurs.

 include/linux/mm_inline.h | 19 ++---------
 include/linux/mmzone.h    |  7 ++++
 include/linux/swap.h      |  1 +
 mm/compaction.c           | 20 +----------
 mm/migrate.c              |  2 ++
 mm/page-writeback.c       | 17 +++++-----
 mm/page_alloc.c           | 59 +++++++++++----------------------
 mm/vmscan.c               | 84 ++++++++++++++++++++++++++++++++++++++---------
 mm/vmstat.c               |  6 ++++
 9 files changed, 116 insertions(+), 99 deletions(-)

-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/5] mm: add per-zone lru list stat
  2016-07-21 14:10 ` Mel Gorman
@ 2016-07-21 14:10   ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2016-07-21 14:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML, Mel Gorman

From: Minchan Kim <minchan@kernel.org>

While I did stress test with hackbench, I got OOM message frequently which
didn't ever happen in zone-lru.

gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
..
..
 [<c71a76e2>] __alloc_pages_nodemask+0xe52/0xe60
 [<c71f31dc>] ? new_slab+0x39c/0x3b0
 [<c71f31dc>] new_slab+0x39c/0x3b0
 [<c71f4eca>] ___slab_alloc.constprop.87+0x6da/0x840
 [<c763e6fc>] ? __alloc_skb+0x3c/0x260
 [<c777e127>] ? _raw_spin_unlock_irq+0x27/0x60
 [<c70cebfc>] ? trace_hardirqs_on_caller+0xec/0x1b0
 [<c70a1506>] ? finish_task_switch+0xa6/0x220
 [<c7219ee0>] ? poll_select_copy_remaining+0x140/0x140
 [<c7201645>] __slab_alloc.isra.81.constprop.86+0x40/0x6d
 [<c763e6fc>] ? __alloc_skb+0x3c/0x260
 [<c71f525c>] kmem_cache_alloc+0x22c/0x260
 [<c763e6fc>] ? __alloc_skb+0x3c/0x260
 [<c763e6fc>] __alloc_skb+0x3c/0x260
 [<c763eece>] alloc_skb_with_frags+0x4e/0x1a0
 [<c7638d6a>] sock_alloc_send_pskb+0x16a/0x1b0
 [<c770b581>] ? wait_for_unix_gc+0x31/0x90
 [<c71cfb1d>] ? alloc_set_pte+0x2ad/0x310
 [<c77084dd>] unix_stream_sendmsg+0x28d/0x340
 [<c7634dad>] sock_sendmsg+0x2d/0x40
 [<c7634e2c>] sock_write_iter+0x6c/0xc0
 [<c7204a90>] __vfs_write+0xc0/0x120
 [<c72053ab>] vfs_write+0x9b/0x1a0
 [<c71cc4a9>] ? __might_fault+0x49/0xa0
 [<c72062c4>] SyS_write+0x44/0x90
 [<c70036c6>] do_fast_syscall_32+0xa6/0x1e0
 [<c777ea2c>] sysenter_past_esp+0x45/0x74

Mem-Info:
active_anon:104698 inactive_anon:105791 isolated_anon:192
 active_file:433 inactive_file:283 isolated_file:22
 unevictable:0 dirty:0 writeback:296 unstable:0
 slab_reclaimable:6389 slab_unreclaimable:78927
 mapped:474 shmem:0 pagetables:101426 bounce:0
 free:10518 free_pcp:334 free_cma:0
Node 0 active_anon:418792kB inactive_anon:423164kB active_file:1732kB inactive_file:1132kB unevictable:0kB isolated(anon):768kB isolated(file):88kB mapped:1896kB dirty:0kB writeback:1184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1478632 all_unreclaimable? yes
DMA free:3304kB min:68kB low:84kB high:100kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:4088kB kernel_stack:0kB pagetables:2480kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 809 1965 1965
Normal free:3436kB min:3604kB low:4504kB high:5404kB present:897016kB managed:858460kB mlocked:0kB slab_reclaimable:25556kB slab_unreclaimable:311712kB kernel_stack:164608kB pagetables:30844kB bounce:0kB free_pcp:620kB local_pcp:104kB free_cma:0kB
lowmem_reserve[]: 0 0 9247 9247
HighMem free:33808kB min:512kB low:1796kB high:3080kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:372252kB bounce:0kB free_pcp:428kB local_pcp:72kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
DMA: 2*4kB (UM) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 0*1024kB 1*2048kB (U) 0*4096kB = 3192kB
Normal: 33*4kB (MH) 79*8kB (ME) 11*16kB (M) 4*32kB (M) 2*64kB (ME) 2*128kB (EH) 7*256kB (EH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3244kB
HighMem: 2590*4kB (UM) 1568*8kB (UM) 491*16kB (UM) 60*32kB (UM) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33064kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
25121 total pagecache pages
24160 pages in swap cache
Swap cache stats: add 86371, delete 62211, find 42865/60187
Free swap  = 4015560kB
Total swap = 4192252kB
524186 pages RAM
295934 pages HighMem/MovableOnly
9658 pages reserved
0 pages cma reserved

The order-0 allocation for normal zone failed while there are a lot of
reclaimable memory(i.e., anonymous memory with free swap). I wanted to
analyze the problem but it was hard because we removed per-zone lru stat
so I couldn't know how many of anonymous memory there are in normal/dma zone.

When we investigate OOM problem, reclaimable memory count is crucial stat
to find a problem. Without it, it's hard to parse the OOM message so I
believe we should keep it.

With per-zone lru stat,

gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
Mem-Info:
active_anon:101103 inactive_anon:102219 isolated_anon:0
 active_file:503 inactive_file:544 isolated_file:0
 unevictable:0 dirty:0 writeback:34 unstable:0
 slab_reclaimable:6298 slab_unreclaimable:74669
 mapped:863 shmem:0 pagetables:100998 bounce:0
 free:23573 free_pcp:1861 free_cma:0
Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 809 1965 1965
Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
lowmem_reserve[]: 0 0 9247 9247
HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
54409 total pagecache pages
53215 pages in swap cache
Swap cache stats: add 300982, delete 247765, find 157978/226539
Free swap  = 3803244kB
Total swap = 4192252kB
524186 pages RAM
295934 pages HighMem/MovableOnly
9642 pages reserved
0 pages cma reserved

With that, we can see normal zone has a 86M reclaimable memory so we can
know something goes wrong(I will fix the problem in next patch) in reclaim.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mm_inline.h |  2 ++
 include/linux/mmzone.h    |  6 ++++++
 mm/page_alloc.c           | 10 ++++++++++
 mm/vmscan.c               |  9 ---------
 mm/vmstat.c               |  5 +++++
 5 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index bcc4ed07fa90..9cc130f5feb2 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -45,6 +45,8 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
 	__mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
+	__mod_zone_page_state(&pgdat->node_zones[zid],
+				NR_ZONE_LRU_BASE + lru, nr_pages);
 	acct_highmem_file_pages(zid, lru, nr_pages);
 }
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e6aca07cedb7..72625b04e9ba 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -110,6 +110,12 @@ struct zone_padding {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
+	NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
+	NR_ZONE_INACTIVE_ANON = NR_ZONE_LRU_BASE,
+	NR_ZONE_ACTIVE_ANON,
+	NR_ZONE_INACTIVE_FILE,
+	NR_ZONE_ACTIVE_FILE,
+	NR_ZONE_UNEVICTABLE,
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 830ad49a584a..b44c9a8d879a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4388,6 +4388,11 @@ void show_free_areas(unsigned int filter)
 			" min:%lukB"
 			" low:%lukB"
 			" high:%lukB"
+			" active_anon:%lukB"
+			" inactive_anon:%lukB"
+			" active_file:%lukB"
+			" inactive_file:%lukB"
+			" unevictable:%lukB"
 			" present:%lukB"
 			" managed:%lukB"
 			" mlocked:%lukB"
@@ -4405,6 +4410,11 @@ void show_free_areas(unsigned int filter)
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
+			K(zone_page_state(zone, NR_ZONE_ACTIVE_ANON)),
+			K(zone_page_state(zone, NR_ZONE_INACTIVE_ANON)),
+			K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)),
+			K(zone_page_state(zone, NR_ZONE_INACTIVE_FILE)),
+			K(zone_page_state(zone, NR_ZONE_UNEVICTABLE)),
 			K(zone->present_pages),
 			K(zone->managed_pages),
 			K(zone_page_state(zone, NR_MLOCK)),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 22aec2bcfeec..222d5403dd4b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1359,23 +1359,14 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
 			enum lru_list lru, unsigned long *nr_zone_taken,
 			unsigned long nr_taken)
 {
-#ifdef CONFIG_HIGHMEM
 	int zid;
 
-	/*
-	 * Highmem has separate accounting for highmem pages so each zone
-	 * is updated separately.
-	 */
 	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 		if (!nr_zone_taken[zid])
 			continue;
 
 		__update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
 	}
-#else
-	/* Zone ID does not matter on !HIGHMEM */
-	__update_lru_size(lruvec, lru, 0, -nr_taken);
-#endif
 
 #ifdef CONFIG_MEMCG
 	mem_cgroup_update_lru_size(lruvec, lru, -nr_taken);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 91ecca96dcae..f10aad81a9a3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -921,6 +921,11 @@ int fragmentation_index(struct zone *zone, unsigned int order)
 const char * const vmstat_text[] = {
 	/* enum zone_stat_item countes */
 	"nr_free_pages",
+	"nr_inactive_anon",
+	"nr_active_anon",
+	"nr_inactive_file",
+	"nr_active_file",
+	"nr_unevictable",
 	"nr_mlock",
 	"nr_slab_reclaimable",
 	"nr_slab_unreclaimable",
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 1/5] mm: add per-zone lru list stat
@ 2016-07-21 14:10   ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2016-07-21 14:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML, Mel Gorman

From: Minchan Kim <minchan@kernel.org>

While I did stress test with hackbench, I got OOM message frequently which
didn't ever happen in zone-lru.

gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
..
..
 [<c71a76e2>] __alloc_pages_nodemask+0xe52/0xe60
 [<c71f31dc>] ? new_slab+0x39c/0x3b0
 [<c71f31dc>] new_slab+0x39c/0x3b0
 [<c71f4eca>] ___slab_alloc.constprop.87+0x6da/0x840
 [<c763e6fc>] ? __alloc_skb+0x3c/0x260
 [<c777e127>] ? _raw_spin_unlock_irq+0x27/0x60
 [<c70cebfc>] ? trace_hardirqs_on_caller+0xec/0x1b0
 [<c70a1506>] ? finish_task_switch+0xa6/0x220
 [<c7219ee0>] ? poll_select_copy_remaining+0x140/0x140
 [<c7201645>] __slab_alloc.isra.81.constprop.86+0x40/0x6d
 [<c763e6fc>] ? __alloc_skb+0x3c/0x260
 [<c71f525c>] kmem_cache_alloc+0x22c/0x260
 [<c763e6fc>] ? __alloc_skb+0x3c/0x260
 [<c763e6fc>] __alloc_skb+0x3c/0x260
 [<c763eece>] alloc_skb_with_frags+0x4e/0x1a0
 [<c7638d6a>] sock_alloc_send_pskb+0x16a/0x1b0
 [<c770b581>] ? wait_for_unix_gc+0x31/0x90
 [<c71cfb1d>] ? alloc_set_pte+0x2ad/0x310
 [<c77084dd>] unix_stream_sendmsg+0x28d/0x340
 [<c7634dad>] sock_sendmsg+0x2d/0x40
 [<c7634e2c>] sock_write_iter+0x6c/0xc0
 [<c7204a90>] __vfs_write+0xc0/0x120
 [<c72053ab>] vfs_write+0x9b/0x1a0
 [<c71cc4a9>] ? __might_fault+0x49/0xa0
 [<c72062c4>] SyS_write+0x44/0x90
 [<c70036c6>] do_fast_syscall_32+0xa6/0x1e0
 [<c777ea2c>] sysenter_past_esp+0x45/0x74

Mem-Info:
active_anon:104698 inactive_anon:105791 isolated_anon:192
 active_file:433 inactive_file:283 isolated_file:22
 unevictable:0 dirty:0 writeback:296 unstable:0
 slab_reclaimable:6389 slab_unreclaimable:78927
 mapped:474 shmem:0 pagetables:101426 bounce:0
 free:10518 free_pcp:334 free_cma:0
Node 0 active_anon:418792kB inactive_anon:423164kB active_file:1732kB inactive_file:1132kB unevictable:0kB isolated(anon):768kB isolated(file):88kB mapped:1896kB dirty:0kB writeback:1184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1478632 all_unreclaimable? yes
DMA free:3304kB min:68kB low:84kB high:100kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:4088kB kernel_stack:0kB pagetables:2480kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 809 1965 1965
Normal free:3436kB min:3604kB low:4504kB high:5404kB present:897016kB managed:858460kB mlocked:0kB slab_reclaimable:25556kB slab_unreclaimable:311712kB kernel_stack:164608kB pagetables:30844kB bounce:0kB free_pcp:620kB local_pcp:104kB free_cma:0kB
lowmem_reserve[]: 0 0 9247 9247
HighMem free:33808kB min:512kB low:1796kB high:3080kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:372252kB bounce:0kB free_pcp:428kB local_pcp:72kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
DMA: 2*4kB (UM) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 0*1024kB 1*2048kB (U) 0*4096kB = 3192kB
Normal: 33*4kB (MH) 79*8kB (ME) 11*16kB (M) 4*32kB (M) 2*64kB (ME) 2*128kB (EH) 7*256kB (EH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3244kB
HighMem: 2590*4kB (UM) 1568*8kB (UM) 491*16kB (UM) 60*32kB (UM) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33064kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
25121 total pagecache pages
24160 pages in swap cache
Swap cache stats: add 86371, delete 62211, find 42865/60187
Free swap  = 4015560kB
Total swap = 4192252kB
524186 pages RAM
295934 pages HighMem/MovableOnly
9658 pages reserved
0 pages cma reserved

The order-0 allocation for normal zone failed while there are a lot of
reclaimable memory(i.e., anonymous memory with free swap). I wanted to
analyze the problem but it was hard because we removed per-zone lru stat
so I couldn't know how many of anonymous memory there are in normal/dma zone.

When we investigate OOM problem, reclaimable memory count is crucial stat
to find a problem. Without it, it's hard to parse the OOM message so I
believe we should keep it.

With per-zone lru stat,

gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
Mem-Info:
active_anon:101103 inactive_anon:102219 isolated_anon:0
 active_file:503 inactive_file:544 isolated_file:0
 unevictable:0 dirty:0 writeback:34 unstable:0
 slab_reclaimable:6298 slab_unreclaimable:74669
 mapped:863 shmem:0 pagetables:100998 bounce:0
 free:23573 free_pcp:1861 free_cma:0
Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 809 1965 1965
Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
lowmem_reserve[]: 0 0 9247 9247
HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
54409 total pagecache pages
53215 pages in swap cache
Swap cache stats: add 300982, delete 247765, find 157978/226539
Free swap  = 3803244kB
Total swap = 4192252kB
524186 pages RAM
295934 pages HighMem/MovableOnly
9642 pages reserved
0 pages cma reserved

With that, we can see normal zone has a 86M reclaimable memory so we can
know something goes wrong(I will fix the problem in next patch) in reclaim.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mm_inline.h |  2 ++
 include/linux/mmzone.h    |  6 ++++++
 mm/page_alloc.c           | 10 ++++++++++
 mm/vmscan.c               |  9 ---------
 mm/vmstat.c               |  5 +++++
 5 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index bcc4ed07fa90..9cc130f5feb2 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -45,6 +45,8 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
 	__mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
+	__mod_zone_page_state(&pgdat->node_zones[zid],
+				NR_ZONE_LRU_BASE + lru, nr_pages);
 	acct_highmem_file_pages(zid, lru, nr_pages);
 }
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e6aca07cedb7..72625b04e9ba 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -110,6 +110,12 @@ struct zone_padding {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
+	NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
+	NR_ZONE_INACTIVE_ANON = NR_ZONE_LRU_BASE,
+	NR_ZONE_ACTIVE_ANON,
+	NR_ZONE_INACTIVE_FILE,
+	NR_ZONE_ACTIVE_FILE,
+	NR_ZONE_UNEVICTABLE,
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 830ad49a584a..b44c9a8d879a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4388,6 +4388,11 @@ void show_free_areas(unsigned int filter)
 			" min:%lukB"
 			" low:%lukB"
 			" high:%lukB"
+			" active_anon:%lukB"
+			" inactive_anon:%lukB"
+			" active_file:%lukB"
+			" inactive_file:%lukB"
+			" unevictable:%lukB"
 			" present:%lukB"
 			" managed:%lukB"
 			" mlocked:%lukB"
@@ -4405,6 +4410,11 @@ void show_free_areas(unsigned int filter)
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
+			K(zone_page_state(zone, NR_ZONE_ACTIVE_ANON)),
+			K(zone_page_state(zone, NR_ZONE_INACTIVE_ANON)),
+			K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)),
+			K(zone_page_state(zone, NR_ZONE_INACTIVE_FILE)),
+			K(zone_page_state(zone, NR_ZONE_UNEVICTABLE)),
 			K(zone->present_pages),
 			K(zone->managed_pages),
 			K(zone_page_state(zone, NR_MLOCK)),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 22aec2bcfeec..222d5403dd4b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1359,23 +1359,14 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
 			enum lru_list lru, unsigned long *nr_zone_taken,
 			unsigned long nr_taken)
 {
-#ifdef CONFIG_HIGHMEM
 	int zid;
 
-	/*
-	 * Highmem has separate accounting for highmem pages so each zone
-	 * is updated separately.
-	 */
 	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 		if (!nr_zone_taken[zid])
 			continue;
 
 		__update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
 	}
-#else
-	/* Zone ID does not matter on !HIGHMEM */
-	__update_lru_size(lruvec, lru, 0, -nr_taken);
-#endif
 
 #ifdef CONFIG_MEMCG
 	mem_cgroup_update_lru_size(lruvec, lru, -nr_taken);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 91ecca96dcae..f10aad81a9a3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -921,6 +921,11 @@ int fragmentation_index(struct zone *zone, unsigned int order)
 const char * const vmstat_text[] = {
 	/* enum zone_stat_item countes */
 	"nr_free_pages",
+	"nr_inactive_anon",
+	"nr_active_anon",
+	"nr_inactive_file",
+	"nr_active_file",
+	"nr_unevictable",
 	"nr_mlock",
 	"nr_slab_reclaimable",
 	"nr_slab_unreclaimable",
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/5] mm, vmscan: Remove highmem_file_pages
  2016-07-21 14:10 ` Mel Gorman
@ 2016-07-21 14:10   ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2016-07-21 14:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML, Mel Gorman

With the reintroduction of per-zone LRU stats, highmem_file_pages is
redundant so remove it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mm_inline.h | 17 -----------------
 mm/page-writeback.c       | 12 ++++--------
 2 files changed, 4 insertions(+), 25 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 9cc130f5feb2..71613e8a720f 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -4,22 +4,6 @@
 #include <linux/huge_mm.h>
 #include <linux/swap.h>
 
-#ifdef CONFIG_HIGHMEM
-extern atomic_t highmem_file_pages;
-
-static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
-							int nr_pages)
-{
-	if (is_highmem_idx(zid) && is_file_lru(lru))
-		atomic_add(nr_pages, &highmem_file_pages);
-}
-#else
-static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
-							int nr_pages)
-{
-}
-#endif
-
 /**
  * page_is_file_cache - should the page be on a file LRU or anon LRU?
  * @page: the page to test
@@ -47,7 +31,6 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
 	__mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
 	__mod_zone_page_state(&pgdat->node_zones[zid],
 				NR_ZONE_LRU_BASE + lru, nr_pages);
-	acct_highmem_file_pages(zid, lru, nr_pages);
 }
 
 static __always_inline void update_lru_size(struct lruvec *lruvec,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 573d138fa7a5..cfa78124c3c2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -299,17 +299,13 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat)
 
 	return nr_pages;
 }
-#ifdef CONFIG_HIGHMEM
-atomic_t highmem_file_pages;
-#endif
 
 static unsigned long highmem_dirtyable_memory(unsigned long total)
 {
 #ifdef CONFIG_HIGHMEM
 	int node;
-	unsigned long x;
+	unsigned long x = 0;
 	int i;
-	unsigned long dirtyable = 0;
 
 	for_each_node_state(node, N_HIGH_MEMORY) {
 		for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) {
@@ -326,12 +322,12 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
 			nr_pages = zone_page_state(z, NR_FREE_PAGES);
 			/* watch for underflows */
 			nr_pages -= min(nr_pages, high_wmark_pages(z));
-			dirtyable += nr_pages;
+			nr_pages += zone_page_state(z, NR_INACTIVE_FILE);
+			nr_pages += zone_page_state(z, NR_ACTIVE_FILE);
+			x += nr_pages;
 		}
 	}
 
-	x = dirtyable + atomic_read(&highmem_file_pages);
-
 	/*
 	 * Unreclaimable memory (kernel memory or anonymous memory
 	 * without swap) can bring down the dirtyable pages below
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/5] mm, vmscan: Remove highmem_file_pages
@ 2016-07-21 14:10   ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2016-07-21 14:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML, Mel Gorman

With the reintroduction of per-zone LRU stats, highmem_file_pages is
redundant so remove it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mm_inline.h | 17 -----------------
 mm/page-writeback.c       | 12 ++++--------
 2 files changed, 4 insertions(+), 25 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 9cc130f5feb2..71613e8a720f 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -4,22 +4,6 @@
 #include <linux/huge_mm.h>
 #include <linux/swap.h>
 
-#ifdef CONFIG_HIGHMEM
-extern atomic_t highmem_file_pages;
-
-static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
-							int nr_pages)
-{
-	if (is_highmem_idx(zid) && is_file_lru(lru))
-		atomic_add(nr_pages, &highmem_file_pages);
-}
-#else
-static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
-							int nr_pages)
-{
-}
-#endif
-
 /**
  * page_is_file_cache - should the page be on a file LRU or anon LRU?
  * @page: the page to test
@@ -47,7 +31,6 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
 	__mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
 	__mod_zone_page_state(&pgdat->node_zones[zid],
 				NR_ZONE_LRU_BASE + lru, nr_pages);
-	acct_highmem_file_pages(zid, lru, nr_pages);
 }
 
 static __always_inline void update_lru_size(struct lruvec *lruvec,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 573d138fa7a5..cfa78124c3c2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -299,17 +299,13 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat)
 
 	return nr_pages;
 }
-#ifdef CONFIG_HIGHMEM
-atomic_t highmem_file_pages;
-#endif
 
 static unsigned long highmem_dirtyable_memory(unsigned long total)
 {
 #ifdef CONFIG_HIGHMEM
 	int node;
-	unsigned long x;
+	unsigned long x = 0;
 	int i;
-	unsigned long dirtyable = 0;
 
 	for_each_node_state(node, N_HIGH_MEMORY) {
 		for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) {
@@ -326,12 +322,12 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
 			nr_pages = zone_page_state(z, NR_FREE_PAGES);
 			/* watch for underflows */
 			nr_pages -= min(nr_pages, high_wmark_pages(z));
-			dirtyable += nr_pages;
+			nr_pages += zone_page_state(z, NR_INACTIVE_FILE);
+			nr_pages += zone_page_state(z, NR_ACTIVE_FILE);
+			x += nr_pages;
 		}
 	}
 
-	x = dirtyable + atomic_read(&highmem_file_pages);
-
 	/*
 	 * Unreclaimable memory (kernel memory or anonymous memory
 	 * without swap) can bring down the dirtyable pages below
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/5] mm: Remove reclaim and compaction retry approximations
  2016-07-21 14:10 ` Mel Gorman
@ 2016-07-21 14:10   ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2016-07-21 14:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML, Mel Gorman

If per-zone LRU accounting is available then there is no point
approximating whether reclaim and compaction should retry based on pgdat
statistics. This is effectively a revert of "mm, vmstat: remove zone and
node double accounting by approximating retries" with the difference that
inactive/active stats are still available. This preserves the history of
why the approximation was retried and why it had to be reverted to handle
OOM kills on 32-bit systems.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h |  1 +
 include/linux/swap.h   |  1 +
 mm/compaction.c        | 20 +-------------------
 mm/migrate.c           |  2 ++
 mm/page-writeback.c    |  5 +++++
 mm/page_alloc.c        | 49 ++++++++++---------------------------------------
 mm/vmscan.c            | 18 ++++++++++++++++++
 mm/vmstat.c            |  1 +
 8 files changed, 39 insertions(+), 58 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 72625b04e9ba..f2e4e90621ec 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -116,6 +116,7 @@ enum zone_stat_item {
 	NR_ZONE_INACTIVE_FILE,
 	NR_ZONE_ACTIVE_FILE,
 	NR_ZONE_UNEVICTABLE,
+	NR_ZONE_WRITE_PENDING,	/* Count of dirty, writeback and unstable pages */
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index cc753c639e3d..b17cc4830fa6 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -307,6 +307,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
+extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff --git a/mm/compaction.c b/mm/compaction.c
index cd93ea24c565..e5995f38d677 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1438,11 +1438,6 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 {
 	struct zone *zone;
 	struct zoneref *z;
-	pg_data_t *last_pgdat = NULL;
-
-	/* Do not retry compaction for zone-constrained allocations */
-	if (ac->high_zoneidx < ZONE_NORMAL)
-		return false;
 
 	/*
 	 * Make sure at least one zone would pass __compaction_suitable if we continue
@@ -1453,27 +1448,14 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 		unsigned long available;
 		enum compact_result compact_result;
 
-		if (last_pgdat == zone->zone_pgdat)
-			continue;
-
-		/*
-		 * This over-estimates the number of pages available for
-		 * reclaim/compaction but walking the LRU would take too
-		 * long. The consequences are that compaction may retry
-		 * longer than it should for a zone-constrained allocation
-		 * request.
-		 */
-		last_pgdat = zone->zone_pgdat;
-		available = pgdat_reclaimable_pages(zone->zone_pgdat) / order;
-
 		/*
 		 * Do not consider all the reclaimable memory because we do not
 		 * want to trash just for a single high order allocation which
 		 * is even not guaranteed to appear even if __compaction_suitable
 		 * is happy about the watermark check.
 		 */
+		available = zone_reclaimable_pages(zone) / order;
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
-		available = min(zone->managed_pages, available);
 		compact_result = __compaction_suitable(zone, order, alloc_flags,
 				ac_classzone_idx(ac), available);
 		if (compact_result != COMPACT_SKIPPED &&
diff --git a/mm/migrate.c b/mm/migrate.c
index ed2f85e61de1..ed0268268e93 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -513,7 +513,9 @@ int migrate_page_move_mapping(struct address_space *mapping,
 		}
 		if (dirty && mapping_cap_account_dirty(mapping)) {
 			__dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY);
+			__dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING);
 			__inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY);
+			__inc_zone_state(newzone, NR_ZONE_WRITE_PENDING);
 		}
 	}
 	local_irq_enable();
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index cfa78124c3c2..7e9061ec040b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2462,6 +2462,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 		__inc_node_page_state(page, NR_FILE_DIRTY);
+		__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		__inc_node_page_state(page, NR_DIRTIED);
 		__inc_wb_stat(wb, WB_RECLAIMABLE);
 		__inc_wb_stat(wb, WB_DIRTIED);
@@ -2483,6 +2484,7 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
 	if (mapping_cap_account_dirty(mapping)) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 		dec_node_page_state(page, NR_FILE_DIRTY);
+		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		dec_wb_stat(wb, WB_RECLAIMABLE);
 		task_io_account_cancelled_write(PAGE_SIZE);
 	}
@@ -2739,6 +2741,7 @@ int clear_page_dirty_for_io(struct page *page)
 		if (TestClearPageDirty(page)) {
 			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 			dec_node_page_state(page, NR_FILE_DIRTY);
+			dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 			dec_wb_stat(wb, WB_RECLAIMABLE);
 			ret = 1;
 		}
@@ -2785,6 +2788,7 @@ int test_clear_page_writeback(struct page *page)
 	if (ret) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		dec_node_page_state(page, NR_WRITEBACK);
+		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		inc_node_page_state(page, NR_WRITTEN);
 	}
 	unlock_page_memcg(page);
@@ -2839,6 +2843,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 	if (!ret) {
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		inc_node_page_state(page, NR_WRITEBACK);
+		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 	}
 	unlock_page_memcg(page);
 	return ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b44c9a8d879a..afb254e22235 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3434,7 +3434,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 {
 	struct zone *zone;
 	struct zoneref *z;
-	pg_data_t *current_pgdat = NULL;
 
 	/*
 	 * Make sure we converge to OOM if we cannot make any progress
@@ -3444,15 +3443,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		return false;
 
 	/*
-	 * Blindly retry lowmem allocation requests that are often ignored by
-	 * the OOM killer up to MAX_RECLAIM_RETRIES as we not have a reliable
-	 * and fast means of calculating reclaimable, dirty and writeback pages
-	 * in eligible zones.
-	 */
-	if (ac->high_zoneidx < ZONE_NORMAL)
-		goto out;
-
-	/*
 	 * Keep reclaiming pages while there is a chance this will lead somewhere.
 	 * If none of the target zones can satisfy our allocation request even
 	 * if all reclaimable pages are considered then we are screwed and have
@@ -3462,38 +3452,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 					ac->nodemask) {
 		unsigned long available;
 		unsigned long reclaimable;
-		int zid;
 
-		if (current_pgdat == zone->zone_pgdat)
-			continue;
-
-		current_pgdat = zone->zone_pgdat;
-		available = reclaimable = pgdat_reclaimable_pages(current_pgdat);
+		available = reclaimable = zone_reclaimable_pages(zone);
 		available -= DIV_ROUND_UP(no_progress_loops * available,
 					  MAX_RECLAIM_RETRIES);
-
-		/* Account for all free pages on eligible zones */
-		for (zid = 0; zid <= zone_idx(zone); zid++) {
-			struct zone *acct_zone = &current_pgdat->node_zones[zid];
-
-			available += zone_page_state_snapshot(acct_zone, NR_FREE_PAGES);
-		}
+		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
 
 		/*
 		 * Would the allocation succeed if we reclaimed the whole
-		 * available? This is approximate because there is no
-		 * accurate count of reclaimable pages per zone.
+		 * available?
 		 */
-		for (zid = 0; zid <= zone_idx(zone); zid++) {
-			struct zone *check_zone = &current_pgdat->node_zones[zid];
-			unsigned long estimate;
-
-			estimate = min(check_zone->managed_pages, available);
-			if (!__zone_watermark_ok(check_zone, order,
-					min_wmark_pages(check_zone), ac_classzone_idx(ac),
-					alloc_flags, estimate))
-				continue;
-
+		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
+				ac_classzone_idx(ac), alloc_flags, available)) {
 			/*
 			 * If we didn't make any progress and have a lot of
 			 * dirty + writeback pages then we should wait for
@@ -3503,16 +3473,15 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 			if (!did_some_progress) {
 				unsigned long write_pending;
 
-				write_pending =
-					node_page_state(current_pgdat, NR_WRITEBACK) +
-					node_page_state(current_pgdat, NR_FILE_DIRTY);
+				write_pending = zone_page_state_snapshot(zone,
+							NR_ZONE_WRITE_PENDING);
 
 				if (2 * write_pending > reclaimable) {
 					congestion_wait(BLK_RW_ASYNC, HZ/10);
 					return true;
 				}
 			}
-out:
+
 			/*
 			 * Memory allocation/reclaim might be called from a WQ
 			 * context and the current implementation of the WQ
@@ -4393,6 +4362,7 @@ void show_free_areas(unsigned int filter)
 			" active_file:%lukB"
 			" inactive_file:%lukB"
 			" unevictable:%lukB"
+			" writepending:%lukB"
 			" present:%lukB"
 			" managed:%lukB"
 			" mlocked:%lukB"
@@ -4415,6 +4385,7 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)),
 			K(zone_page_state(zone, NR_ZONE_INACTIVE_FILE)),
 			K(zone_page_state(zone, NR_ZONE_UNEVICTABLE)),
+			K(zone_page_state(zone, NR_ZONE_WRITE_PENDING)),
 			K(zone->present_pages),
 			K(zone->managed_pages),
 			K(zone_page_state(zone, NR_MLOCK)),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 222d5403dd4b..134381a20099 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -194,6 +194,24 @@ static bool sane_reclaim(struct scan_control *sc)
 }
 #endif
 
+/*
+ * This misses isolated pages which are not accounted for to save counters.
+ * As the data only determines if reclaim or compaction continues, it is
+ * not expected that isolated pages will be a dominating factor.
+ */
+unsigned long zone_reclaimable_pages(struct zone *zone)
+{
+	unsigned long nr;
+
+	nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
+		zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
+	if (get_nr_swap_pages() > 0)
+		nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
+			zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
+
+	return nr;
+}
+
 unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
 {
 	unsigned long nr;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f10aad81a9a3..e1a46906c61b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -926,6 +926,7 @@ const char * const vmstat_text[] = {
 	"nr_inactive_file",
 	"nr_active_file",
 	"nr_unevictable",
+	"nr_zone_write_pending",
 	"nr_mlock",
 	"nr_slab_reclaimable",
 	"nr_slab_unreclaimable",
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/5] mm: Remove reclaim and compaction retry approximations
@ 2016-07-21 14:10   ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2016-07-21 14:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML, Mel Gorman

If per-zone LRU accounting is available then there is no point
approximating whether reclaim and compaction should retry based on pgdat
statistics. This is effectively a revert of "mm, vmstat: remove zone and
node double accounting by approximating retries" with the difference that
inactive/active stats are still available. This preserves the history of
why the approximation was retried and why it had to be reverted to handle
OOM kills on 32-bit systems.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h |  1 +
 include/linux/swap.h   |  1 +
 mm/compaction.c        | 20 +-------------------
 mm/migrate.c           |  2 ++
 mm/page-writeback.c    |  5 +++++
 mm/page_alloc.c        | 49 ++++++++++---------------------------------------
 mm/vmscan.c            | 18 ++++++++++++++++++
 mm/vmstat.c            |  1 +
 8 files changed, 39 insertions(+), 58 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 72625b04e9ba..f2e4e90621ec 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -116,6 +116,7 @@ enum zone_stat_item {
 	NR_ZONE_INACTIVE_FILE,
 	NR_ZONE_ACTIVE_FILE,
 	NR_ZONE_UNEVICTABLE,
+	NR_ZONE_WRITE_PENDING,	/* Count of dirty, writeback and unstable pages */
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index cc753c639e3d..b17cc4830fa6 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -307,6 +307,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
+extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff --git a/mm/compaction.c b/mm/compaction.c
index cd93ea24c565..e5995f38d677 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1438,11 +1438,6 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 {
 	struct zone *zone;
 	struct zoneref *z;
-	pg_data_t *last_pgdat = NULL;
-
-	/* Do not retry compaction for zone-constrained allocations */
-	if (ac->high_zoneidx < ZONE_NORMAL)
-		return false;
 
 	/*
 	 * Make sure at least one zone would pass __compaction_suitable if we continue
@@ -1453,27 +1448,14 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 		unsigned long available;
 		enum compact_result compact_result;
 
-		if (last_pgdat == zone->zone_pgdat)
-			continue;
-
-		/*
-		 * This over-estimates the number of pages available for
-		 * reclaim/compaction but walking the LRU would take too
-		 * long. The consequences are that compaction may retry
-		 * longer than it should for a zone-constrained allocation
-		 * request.
-		 */
-		last_pgdat = zone->zone_pgdat;
-		available = pgdat_reclaimable_pages(zone->zone_pgdat) / order;
-
 		/*
 		 * Do not consider all the reclaimable memory because we do not
 		 * want to trash just for a single high order allocation which
 		 * is even not guaranteed to appear even if __compaction_suitable
 		 * is happy about the watermark check.
 		 */
+		available = zone_reclaimable_pages(zone) / order;
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
-		available = min(zone->managed_pages, available);
 		compact_result = __compaction_suitable(zone, order, alloc_flags,
 				ac_classzone_idx(ac), available);
 		if (compact_result != COMPACT_SKIPPED &&
diff --git a/mm/migrate.c b/mm/migrate.c
index ed2f85e61de1..ed0268268e93 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -513,7 +513,9 @@ int migrate_page_move_mapping(struct address_space *mapping,
 		}
 		if (dirty && mapping_cap_account_dirty(mapping)) {
 			__dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY);
+			__dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING);
 			__inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY);
+			__inc_zone_state(newzone, NR_ZONE_WRITE_PENDING);
 		}
 	}
 	local_irq_enable();
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index cfa78124c3c2..7e9061ec040b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2462,6 +2462,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 		__inc_node_page_state(page, NR_FILE_DIRTY);
+		__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		__inc_node_page_state(page, NR_DIRTIED);
 		__inc_wb_stat(wb, WB_RECLAIMABLE);
 		__inc_wb_stat(wb, WB_DIRTIED);
@@ -2483,6 +2484,7 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
 	if (mapping_cap_account_dirty(mapping)) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 		dec_node_page_state(page, NR_FILE_DIRTY);
+		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		dec_wb_stat(wb, WB_RECLAIMABLE);
 		task_io_account_cancelled_write(PAGE_SIZE);
 	}
@@ -2739,6 +2741,7 @@ int clear_page_dirty_for_io(struct page *page)
 		if (TestClearPageDirty(page)) {
 			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 			dec_node_page_state(page, NR_FILE_DIRTY);
+			dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 			dec_wb_stat(wb, WB_RECLAIMABLE);
 			ret = 1;
 		}
@@ -2785,6 +2788,7 @@ int test_clear_page_writeback(struct page *page)
 	if (ret) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		dec_node_page_state(page, NR_WRITEBACK);
+		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		inc_node_page_state(page, NR_WRITTEN);
 	}
 	unlock_page_memcg(page);
@@ -2839,6 +2843,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 	if (!ret) {
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		inc_node_page_state(page, NR_WRITEBACK);
+		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 	}
 	unlock_page_memcg(page);
 	return ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b44c9a8d879a..afb254e22235 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3434,7 +3434,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 {
 	struct zone *zone;
 	struct zoneref *z;
-	pg_data_t *current_pgdat = NULL;
 
 	/*
 	 * Make sure we converge to OOM if we cannot make any progress
@@ -3444,15 +3443,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		return false;
 
 	/*
-	 * Blindly retry lowmem allocation requests that are often ignored by
-	 * the OOM killer up to MAX_RECLAIM_RETRIES as we not have a reliable
-	 * and fast means of calculating reclaimable, dirty and writeback pages
-	 * in eligible zones.
-	 */
-	if (ac->high_zoneidx < ZONE_NORMAL)
-		goto out;
-
-	/*
 	 * Keep reclaiming pages while there is a chance this will lead somewhere.
 	 * If none of the target zones can satisfy our allocation request even
 	 * if all reclaimable pages are considered then we are screwed and have
@@ -3462,38 +3452,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 					ac->nodemask) {
 		unsigned long available;
 		unsigned long reclaimable;
-		int zid;
 
-		if (current_pgdat == zone->zone_pgdat)
-			continue;
-
-		current_pgdat = zone->zone_pgdat;
-		available = reclaimable = pgdat_reclaimable_pages(current_pgdat);
+		available = reclaimable = zone_reclaimable_pages(zone);
 		available -= DIV_ROUND_UP(no_progress_loops * available,
 					  MAX_RECLAIM_RETRIES);
-
-		/* Account for all free pages on eligible zones */
-		for (zid = 0; zid <= zone_idx(zone); zid++) {
-			struct zone *acct_zone = &current_pgdat->node_zones[zid];
-
-			available += zone_page_state_snapshot(acct_zone, NR_FREE_PAGES);
-		}
+		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
 
 		/*
 		 * Would the allocation succeed if we reclaimed the whole
-		 * available? This is approximate because there is no
-		 * accurate count of reclaimable pages per zone.
+		 * available?
 		 */
-		for (zid = 0; zid <= zone_idx(zone); zid++) {
-			struct zone *check_zone = &current_pgdat->node_zones[zid];
-			unsigned long estimate;
-
-			estimate = min(check_zone->managed_pages, available);
-			if (!__zone_watermark_ok(check_zone, order,
-					min_wmark_pages(check_zone), ac_classzone_idx(ac),
-					alloc_flags, estimate))
-				continue;
-
+		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
+				ac_classzone_idx(ac), alloc_flags, available)) {
 			/*
 			 * If we didn't make any progress and have a lot of
 			 * dirty + writeback pages then we should wait for
@@ -3503,16 +3473,15 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 			if (!did_some_progress) {
 				unsigned long write_pending;
 
-				write_pending =
-					node_page_state(current_pgdat, NR_WRITEBACK) +
-					node_page_state(current_pgdat, NR_FILE_DIRTY);
+				write_pending = zone_page_state_snapshot(zone,
+							NR_ZONE_WRITE_PENDING);
 
 				if (2 * write_pending > reclaimable) {
 					congestion_wait(BLK_RW_ASYNC, HZ/10);
 					return true;
 				}
 			}
-out:
+
 			/*
 			 * Memory allocation/reclaim might be called from a WQ
 			 * context and the current implementation of the WQ
@@ -4393,6 +4362,7 @@ void show_free_areas(unsigned int filter)
 			" active_file:%lukB"
 			" inactive_file:%lukB"
 			" unevictable:%lukB"
+			" writepending:%lukB"
 			" present:%lukB"
 			" managed:%lukB"
 			" mlocked:%lukB"
@@ -4415,6 +4385,7 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)),
 			K(zone_page_state(zone, NR_ZONE_INACTIVE_FILE)),
 			K(zone_page_state(zone, NR_ZONE_UNEVICTABLE)),
+			K(zone_page_state(zone, NR_ZONE_WRITE_PENDING)),
 			K(zone->present_pages),
 			K(zone->managed_pages),
 			K(zone_page_state(zone, NR_MLOCK)),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 222d5403dd4b..134381a20099 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -194,6 +194,24 @@ static bool sane_reclaim(struct scan_control *sc)
 }
 #endif
 
+/*
+ * This misses isolated pages which are not accounted for to save counters.
+ * As the data only determines if reclaim or compaction continues, it is
+ * not expected that isolated pages will be a dominating factor.
+ */
+unsigned long zone_reclaimable_pages(struct zone *zone)
+{
+	unsigned long nr;
+
+	nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
+		zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
+	if (get_nr_swap_pages() > 0)
+		nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
+			zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
+
+	return nr;
+}
+
 unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
 {
 	unsigned long nr;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f10aad81a9a3..e1a46906c61b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -926,6 +926,7 @@ const char * const vmstat_text[] = {
 	"nr_inactive_file",
 	"nr_active_file",
 	"nr_unevictable",
+	"nr_zone_write_pending",
 	"nr_mlock",
 	"nr_slab_reclaimable",
 	"nr_slab_unreclaimable",
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/5] mm: consider per-zone inactive ratio to deactivate
  2016-07-21 14:10 ` Mel Gorman
@ 2016-07-21 14:11   ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2016-07-21 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML, Mel Gorman

From: Minchan Kim <minchan@kernel.org>

Minchan Kim reported that with per-zone lru state it was possible to
identify that a normal zone with 8^M anonymous pages could trigger
OOM with non-atomic order-0 allocations as all pages in the zone
were in the active list.

   gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
   Call Trace:
    [<c51a76e2>] __alloc_pages_nodemask+0xe52/0xe60
    [<c51f31dc>] ? new_slab+0x39c/0x3b0
    [<c51f31dc>] new_slab+0x39c/0x3b0
    [<c51f4eca>] ___slab_alloc.constprop.87+0x6da/0x840
    [<c563e6fc>] ? __alloc_skb+0x3c/0x260
    [<c50b8e93>] ? enqueue_task_fair+0x73/0xbf0
    [<c5219ee0>] ? poll_select_copy_remaining+0x140/0x140
    [<c5201645>] __slab_alloc.isra.81.constprop.86+0x40/0x6d
    [<c563e6fc>] ? __alloc_skb+0x3c/0x260
    [<c51f525c>] kmem_cache_alloc+0x22c/0x260
    [<c563e6fc>] ? __alloc_skb+0x3c/0x260
    [<c563e6fc>] __alloc_skb+0x3c/0x260
    [<c563eece>] alloc_skb_with_frags+0x4e/0x1a0
    [<c5638d6a>] sock_alloc_send_pskb+0x16a/0x1b0
    [<c570b581>] ? wait_for_unix_gc+0x31/0x90
    [<c57084dd>] unix_stream_sendmsg+0x28d/0x340
    [<c5634dad>] sock_sendmsg+0x2d/0x40
    [<c5634e2c>] sock_write_iter+0x6c/0xc0
    [<c5204a90>] __vfs_write+0xc0/0x120
    [<c52053ab>] vfs_write+0x9b/0x1a0
    [<c51cc4a9>] ? __might_fault+0x49/0xa0
    [<c52062c4>] SyS_write+0x44/0x90
    [<c50036c6>] do_fast_syscall_32+0xa6/0x1e0

   Mem-Info:
   active_anon:101103 inactive_anon:102219 isolated_anon:0
    active_file:503 inactive_file:544 isolated_file:0
    unevictable:0 dirty:0 writeback:34 unstable:0
    slab_reclaimable:6298 slab_unreclaimable:74669
    mapped:863 shmem:0 pagetables:100998 bounce:0
    free:23573 free_pcp:1861 free_cma:0
   Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
   DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
   lowmem_reserve[]: 0 809 1965 1965
   Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
   lowmem_reserve[]: 0 0 9247 9247
   HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
   lowmem_reserve[]: 0 0 0 0
   DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
   Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
   HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
   Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
   54409 total pagecache pages
   53215 pages in swap cache
   Swap cache stats: add 300982, delete 247765, find 157978/226539
   Free swap  = 3803244kB
   Total swap = 4192252kB
   524186 pages RAM
   295934 pages HighMem/MovableOnly
   9642 pages reserved
   0 pages cma reserved

The problem is due to the active deactivation logic in inactive_list_is_low.

	Node 0 active_anon:404412kB inactive_anon:409040kB

IOW, (inactive_anon of node * inactive_ratio > active_anon of node) due to
highmem anonymous stat so VM never deactivates normal zone's anonymous pages.

This patch is a modified version of Minchan's original solution but based
upon it. The problem with Minchan's patch is that it didn't take memcg
into account and any low zone with an imbalanced list could force a rotation.

In this patch, a zone-constrained global reclaim will rotate the list if
the inactive/active ratio of all eligible zones needs to be corrected. It
is possible that higher zone pages will be initially rotated prematurely
but this is the safer choice to maintain overall LRU age.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 37 ++++++++++++++++++++++++++++++++-----
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 134381a20099..6810d81f60c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1964,7 +1964,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
  *    1TB     101        10GB
  *   10TB     320        32GB
  */
-static bool inactive_list_is_low(struct lruvec *lruvec, bool file)
+static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
+						struct scan_control *sc)
 {
 	unsigned long inactive_ratio;
 	unsigned long inactive;
@@ -1981,6 +1982,32 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file)
 	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
 	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
 
+	/*
+	 * For global reclaim on zone-constrained allocations, it is necessary
+	 * to check if rotations are required for lowmem to be reclaimed. This
+	 * calculates the inactive/active pages available in eligible zones.
+	 */
+	if (global_reclaim(sc)) {
+		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+		int zid;
+
+		for (zid = sc->reclaim_idx + 1; zid < MAX_NR_ZONES; zid++) {
+			struct zone *zone = &pgdat->node_zones[zid];
+			unsigned long inactive_zone, active_zone;
+
+			if (!populated_zone(zone))
+				continue;
+
+			inactive_zone = zone_page_state(zone,
+					NR_ZONE_LRU_BASE + (file * LRU_FILE));
+			active_zone = zone_page_state(zone,
+					NR_ZONE_LRU_BASE + (file * LRU_FILE) + LRU_ACTIVE);
+
+			inactive -= min(inactive, inactive_zone);
+			active -= min(active, active_zone);
+		}
+	}
+
 	gb = (inactive + active) >> (30 - PAGE_SHIFT);
 	if (gb)
 		inactive_ratio = int_sqrt(10 * gb);
@@ -1994,7 +2021,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 				 struct lruvec *lruvec, struct scan_control *sc)
 {
 	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(lruvec, is_file_lru(lru)))
+		if (inactive_list_is_low(lruvec, is_file_lru(lru), sc))
 			shrink_active_list(nr_to_scan, lruvec, sc, lru);
 		return 0;
 	}
@@ -2125,7 +2152,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * lruvec even if it has plenty of old anonymous pages unless the
 	 * system is under heavy pressure.
 	 */
-	if (!inactive_list_is_low(lruvec, true) &&
+	if (!inactive_list_is_low(lruvec, true, sc) &&
 	    lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
 		scan_balance = SCAN_FILE;
 		goto out;
@@ -2367,7 +2394,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
 	 * Even if we did not try to evict anon pages at all, we want to
 	 * rebalance the anon lru active/inactive ratio.
 	 */
-	if (inactive_list_is_low(lruvec, false))
+	if (inactive_list_is_low(lruvec, false, sc))
 		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 				   sc, LRU_ACTIVE_ANON);
 
@@ -3020,7 +3047,7 @@ static void age_active_anon(struct pglist_data *pgdat,
 	do {
 		struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
 
-		if (inactive_list_is_low(lruvec, false))
+		if (inactive_list_is_low(lruvec, false, sc))
 			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 					   sc, LRU_ACTIVE_ANON);
 
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/5] mm: consider per-zone inactive ratio to deactivate
@ 2016-07-21 14:11   ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2016-07-21 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML, Mel Gorman

From: Minchan Kim <minchan@kernel.org>

Minchan Kim reported that with per-zone lru state it was possible to
identify that a normal zone with 8^M anonymous pages could trigger
OOM with non-atomic order-0 allocations as all pages in the zone
were in the active list.

   gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
   Call Trace:
    [<c51a76e2>] __alloc_pages_nodemask+0xe52/0xe60
    [<c51f31dc>] ? new_slab+0x39c/0x3b0
    [<c51f31dc>] new_slab+0x39c/0x3b0
    [<c51f4eca>] ___slab_alloc.constprop.87+0x6da/0x840
    [<c563e6fc>] ? __alloc_skb+0x3c/0x260
    [<c50b8e93>] ? enqueue_task_fair+0x73/0xbf0
    [<c5219ee0>] ? poll_select_copy_remaining+0x140/0x140
    [<c5201645>] __slab_alloc.isra.81.constprop.86+0x40/0x6d
    [<c563e6fc>] ? __alloc_skb+0x3c/0x260
    [<c51f525c>] kmem_cache_alloc+0x22c/0x260
    [<c563e6fc>] ? __alloc_skb+0x3c/0x260
    [<c563e6fc>] __alloc_skb+0x3c/0x260
    [<c563eece>] alloc_skb_with_frags+0x4e/0x1a0
    [<c5638d6a>] sock_alloc_send_pskb+0x16a/0x1b0
    [<c570b581>] ? wait_for_unix_gc+0x31/0x90
    [<c57084dd>] unix_stream_sendmsg+0x28d/0x340
    [<c5634dad>] sock_sendmsg+0x2d/0x40
    [<c5634e2c>] sock_write_iter+0x6c/0xc0
    [<c5204a90>] __vfs_write+0xc0/0x120
    [<c52053ab>] vfs_write+0x9b/0x1a0
    [<c51cc4a9>] ? __might_fault+0x49/0xa0
    [<c52062c4>] SyS_write+0x44/0x90
    [<c50036c6>] do_fast_syscall_32+0xa6/0x1e0

   Mem-Info:
   active_anon:101103 inactive_anon:102219 isolated_anon:0
    active_file:503 inactive_file:544 isolated_file:0
    unevictable:0 dirty:0 writeback:34 unstable:0
    slab_reclaimable:6298 slab_unreclaimable:74669
    mapped:863 shmem:0 pagetables:100998 bounce:0
    free:23573 free_pcp:1861 free_cma:0
   Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
   DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
   lowmem_reserve[]: 0 809 1965 1965
   Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
   lowmem_reserve[]: 0 0 9247 9247
   HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
   lowmem_reserve[]: 0 0 0 0
   DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
   Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
   HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
   Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
   54409 total pagecache pages
   53215 pages in swap cache
   Swap cache stats: add 300982, delete 247765, find 157978/226539
   Free swap  = 3803244kB
   Total swap = 4192252kB
   524186 pages RAM
   295934 pages HighMem/MovableOnly
   9642 pages reserved
   0 pages cma reserved

The problem is due to the active deactivation logic in inactive_list_is_low.

	Node 0 active_anon:404412kB inactive_anon:409040kB

IOW, (inactive_anon of node * inactive_ratio > active_anon of node) due to
highmem anonymous stat so VM never deactivates normal zone's anonymous pages.

This patch is a modified version of Minchan's original solution but based
upon it. The problem with Minchan's patch is that it didn't take memcg
into account and any low zone with an imbalanced list could force a rotation.

In this patch, a zone-constrained global reclaim will rotate the list if
the inactive/active ratio of all eligible zones needs to be corrected. It
is possible that higher zone pages will be initially rotated prematurely
but this is the safer choice to maintain overall LRU age.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 37 ++++++++++++++++++++++++++++++++-----
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 134381a20099..6810d81f60c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1964,7 +1964,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
  *    1TB     101        10GB
  *   10TB     320        32GB
  */
-static bool inactive_list_is_low(struct lruvec *lruvec, bool file)
+static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
+						struct scan_control *sc)
 {
 	unsigned long inactive_ratio;
 	unsigned long inactive;
@@ -1981,6 +1982,32 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file)
 	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
 	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
 
+	/*
+	 * For global reclaim on zone-constrained allocations, it is necessary
+	 * to check if rotations are required for lowmem to be reclaimed. This
+	 * calculates the inactive/active pages available in eligible zones.
+	 */
+	if (global_reclaim(sc)) {
+		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+		int zid;
+
+		for (zid = sc->reclaim_idx + 1; zid < MAX_NR_ZONES; zid++) {
+			struct zone *zone = &pgdat->node_zones[zid];
+			unsigned long inactive_zone, active_zone;
+
+			if (!populated_zone(zone))
+				continue;
+
+			inactive_zone = zone_page_state(zone,
+					NR_ZONE_LRU_BASE + (file * LRU_FILE));
+			active_zone = zone_page_state(zone,
+					NR_ZONE_LRU_BASE + (file * LRU_FILE) + LRU_ACTIVE);
+
+			inactive -= min(inactive, inactive_zone);
+			active -= min(active, active_zone);
+		}
+	}
+
 	gb = (inactive + active) >> (30 - PAGE_SHIFT);
 	if (gb)
 		inactive_ratio = int_sqrt(10 * gb);
@@ -1994,7 +2021,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 				 struct lruvec *lruvec, struct scan_control *sc)
 {
 	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(lruvec, is_file_lru(lru)))
+		if (inactive_list_is_low(lruvec, is_file_lru(lru), sc))
 			shrink_active_list(nr_to_scan, lruvec, sc, lru);
 		return 0;
 	}
@@ -2125,7 +2152,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * lruvec even if it has plenty of old anonymous pages unless the
 	 * system is under heavy pressure.
 	 */
-	if (!inactive_list_is_low(lruvec, true) &&
+	if (!inactive_list_is_low(lruvec, true, sc) &&
 	    lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
 		scan_balance = SCAN_FILE;
 		goto out;
@@ -2367,7 +2394,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
 	 * Even if we did not try to evict anon pages at all, we want to
 	 * rebalance the anon lru active/inactive ratio.
 	 */
-	if (inactive_list_is_low(lruvec, false))
+	if (inactive_list_is_low(lruvec, false, sc))
 		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 				   sc, LRU_ACTIVE_ANON);
 
@@ -3020,7 +3047,7 @@ static void age_active_anon(struct pglist_data *pgdat,
 	do {
 		struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
 
-		if (inactive_list_is_low(lruvec, false))
+		if (inactive_list_is_low(lruvec, false, sc))
 			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 					   sc, LRU_ACTIVE_ANON);
 
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 5/5] mm, vmscan: Account for skipped pages as a partial scan
  2016-07-21 14:10 ` Mel Gorman
@ 2016-07-21 14:11   ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2016-07-21 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML, Mel Gorman

Page reclaim determines whether a pgdat is unreclaimable by examining how
many pages have been scanned since a page was freed and comparing that to
the LRU sizes. Skipped pages are not reclaim candidates but contribute to
scanned. This can prematurely mark a pgdat as unreclaimable and trigger
an OOM kill.

This patch accounts for skipped pages as a partial scan so that an
unreclaimable pgdat will still be marked as such but by scaling the cost
of a skip, it'll avoid the pgdat being marked prematurely.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6810d81f60c7..e5af357dd4ac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1424,7 +1424,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	LIST_HEAD(pages_skipped);
 
 	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
-					!list_empty(src); scan++) {
+					!list_empty(src);) {
 		struct page *page;
 
 		page = lru_to_page(src);
@@ -1438,6 +1438,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 			continue;
 		}
 
+		/*
+		 * Account for scanned and skipped separetly to avoid the pgdat
+		 * being prematurely marked unreclaimable by pgdat_reclaimable.
+		 */
+		scan++;
+
 		switch (__isolate_lru_page(page, mode)) {
 		case 0:
 			nr_pages = hpage_nr_pages(page);
@@ -1465,14 +1471,24 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	 */
 	if (!list_empty(&pages_skipped)) {
 		int zid;
+		unsigned long total_skipped = 0;
 
-		list_splice(&pages_skipped, src);
 		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 			if (!nr_skipped[zid])
 				continue;
 
 			__count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
+			total_skipped += nr_skipped[zid];
 		}
+
+		/*
+		 * Account skipped pages as a partial scan as the pgdat may be
+		 * close to unreclaimable. If the LRU list is empty, account
+		 * skipped pages as a full scan.
+		 */
+		scan += list_empty(src) ? total_skipped : total_skipped >> 2;
+
+		list_splice(&pages_skipped, src);
 	}
 	*nr_scanned = scan;
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan,
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 5/5] mm, vmscan: Account for skipped pages as a partial scan
@ 2016-07-21 14:11   ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2016-07-21 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML, Mel Gorman

Page reclaim determines whether a pgdat is unreclaimable by examining how
many pages have been scanned since a page was freed and comparing that to
the LRU sizes. Skipped pages are not reclaim candidates but contribute to
scanned. This can prematurely mark a pgdat as unreclaimable and trigger
an OOM kill.

This patch accounts for skipped pages as a partial scan so that an
unreclaimable pgdat will still be marked as such but by scaling the cost
of a skip, it'll avoid the pgdat being marked prematurely.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6810d81f60c7..e5af357dd4ac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1424,7 +1424,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	LIST_HEAD(pages_skipped);
 
 	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
-					!list_empty(src); scan++) {
+					!list_empty(src);) {
 		struct page *page;
 
 		page = lru_to_page(src);
@@ -1438,6 +1438,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 			continue;
 		}
 
+		/*
+		 * Account for scanned and skipped separetly to avoid the pgdat
+		 * being prematurely marked unreclaimable by pgdat_reclaimable.
+		 */
+		scan++;
+
 		switch (__isolate_lru_page(page, mode)) {
 		case 0:
 			nr_pages = hpage_nr_pages(page);
@@ -1465,14 +1471,24 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	 */
 	if (!list_empty(&pages_skipped)) {
 		int zid;
+		unsigned long total_skipped = 0;
 
-		list_splice(&pages_skipped, src);
 		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 			if (!nr_skipped[zid])
 				continue;
 
 			__count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
+			total_skipped += nr_skipped[zid];
 		}
+
+		/*
+		 * Account skipped pages as a partial scan as the pgdat may be
+		 * close to unreclaimable. If the LRU list is empty, account
+		 * skipped pages as a full scan.
+		 */
+		scan += list_empty(src) ? total_skipped : total_skipped >> 2;
+
+		list_splice(&pages_skipped, src);
 	}
 	*nr_scanned = scan;
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan,
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/5] mm: consider per-zone inactive ratio to deactivate
  2016-07-21 14:11   ` Mel Gorman
@ 2016-07-21 15:52     ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2016-07-21 15:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Thu, Jul 21, 2016 at 03:11:00PM +0100, Mel Gorman wrote:
> @@ -1981,6 +1982,32 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file)
>  	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
>  	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
>  
> +	/*
> +	 * For global reclaim on zone-constrained allocations, it is necessary
> +	 * to check if rotations are required for lowmem to be reclaimed. This

s/rotation/deactivation/

> +	 * calculates the inactive/active pages available in eligible zones.
> +	 */
> +	if (global_reclaim(sc)) {
> +		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +		int zid;
> +
> +		for (zid = sc->reclaim_idx + 1; zid < MAX_NR_ZONES; zid++) {

The emphasis on global vs. memcg reclaim is somewhat strange, because
this is only about excluding pages from the balancing math that will
be skipped. Memcg reclaim is never zone-restricted, but if it were, it
would make sense to exclude the skipped pages there as well.

Indeed, for memcg reclaim sc->reclaim_idx+1 is always MAX_NR_ZONES,
and so the for loop alone will do the right thing.

Can you please drop the global_reclaim() branch, the sc function
parameter, and the "global reclaim" from the comment?

Thanks

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/5] mm: consider per-zone inactive ratio to deactivate
@ 2016-07-21 15:52     ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2016-07-21 15:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Thu, Jul 21, 2016 at 03:11:00PM +0100, Mel Gorman wrote:
> @@ -1981,6 +1982,32 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file)
>  	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
>  	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
>  
> +	/*
> +	 * For global reclaim on zone-constrained allocations, it is necessary
> +	 * to check if rotations are required for lowmem to be reclaimed. This

s/rotation/deactivation/

> +	 * calculates the inactive/active pages available in eligible zones.
> +	 */
> +	if (global_reclaim(sc)) {
> +		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +		int zid;
> +
> +		for (zid = sc->reclaim_idx + 1; zid < MAX_NR_ZONES; zid++) {

The emphasis on global vs. memcg reclaim is somewhat strange, because
this is only about excluding pages from the balancing math that will
be skipped. Memcg reclaim is never zone-restricted, but if it were, it
would make sense to exclude the skipped pages there as well.

Indeed, for memcg reclaim sc->reclaim_idx+1 is always MAX_NR_ZONES,
and so the for loop alone will do the right thing.

Can you please drop the global_reclaim() branch, the sc function
parameter, and the "global reclaim" from the comment?

Thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/5] mm: add per-zone lru list stat
  2016-07-21 14:10   ` Mel Gorman
@ 2016-07-22 15:51     ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2016-07-22 15:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Thu, Jul 21, 2016 at 03:10:57PM +0100, Mel Gorman wrote:
> From: Minchan Kim <minchan@kernel.org>
> 
> While I did stress test with hackbench, I got OOM message frequently which
> didn't ever happen in zone-lru.
> 
> gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
> ..
> ..
>  [<c71a76e2>] __alloc_pages_nodemask+0xe52/0xe60
>  [<c71f31dc>] ? new_slab+0x39c/0x3b0
>  [<c71f31dc>] new_slab+0x39c/0x3b0
>  [<c71f4eca>] ___slab_alloc.constprop.87+0x6da/0x840
>  [<c763e6fc>] ? __alloc_skb+0x3c/0x260
>  [<c777e127>] ? _raw_spin_unlock_irq+0x27/0x60
>  [<c70cebfc>] ? trace_hardirqs_on_caller+0xec/0x1b0
>  [<c70a1506>] ? finish_task_switch+0xa6/0x220
>  [<c7219ee0>] ? poll_select_copy_remaining+0x140/0x140
>  [<c7201645>] __slab_alloc.isra.81.constprop.86+0x40/0x6d
>  [<c763e6fc>] ? __alloc_skb+0x3c/0x260
>  [<c71f525c>] kmem_cache_alloc+0x22c/0x260
>  [<c763e6fc>] ? __alloc_skb+0x3c/0x260
>  [<c763e6fc>] __alloc_skb+0x3c/0x260
>  [<c763eece>] alloc_skb_with_frags+0x4e/0x1a0
>  [<c7638d6a>] sock_alloc_send_pskb+0x16a/0x1b0
>  [<c770b581>] ? wait_for_unix_gc+0x31/0x90
>  [<c71cfb1d>] ? alloc_set_pte+0x2ad/0x310
>  [<c77084dd>] unix_stream_sendmsg+0x28d/0x340
>  [<c7634dad>] sock_sendmsg+0x2d/0x40
>  [<c7634e2c>] sock_write_iter+0x6c/0xc0
>  [<c7204a90>] __vfs_write+0xc0/0x120
>  [<c72053ab>] vfs_write+0x9b/0x1a0
>  [<c71cc4a9>] ? __might_fault+0x49/0xa0
>  [<c72062c4>] SyS_write+0x44/0x90
>  [<c70036c6>] do_fast_syscall_32+0xa6/0x1e0
>  [<c777ea2c>] sysenter_past_esp+0x45/0x74
> 
> Mem-Info:
> active_anon:104698 inactive_anon:105791 isolated_anon:192
>  active_file:433 inactive_file:283 isolated_file:22
>  unevictable:0 dirty:0 writeback:296 unstable:0
>  slab_reclaimable:6389 slab_unreclaimable:78927
>  mapped:474 shmem:0 pagetables:101426 bounce:0
>  free:10518 free_pcp:334 free_cma:0
> Node 0 active_anon:418792kB inactive_anon:423164kB active_file:1732kB inactive_file:1132kB unevictable:0kB isolated(anon):768kB isolated(file):88kB mapped:1896kB dirty:0kB writeback:1184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1478632 all_unreclaimable? yes
> DMA free:3304kB min:68kB low:84kB high:100kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:4088kB kernel_stack:0kB pagetables:2480kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 809 1965 1965
> Normal free:3436kB min:3604kB low:4504kB high:5404kB present:897016kB managed:858460kB mlocked:0kB slab_reclaimable:25556kB slab_unreclaimable:311712kB kernel_stack:164608kB pagetables:30844kB bounce:0kB free_pcp:620kB local_pcp:104kB free_cma:0kB
> lowmem_reserve[]: 0 0 9247 9247
> HighMem free:33808kB min:512kB low:1796kB high:3080kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:372252kB bounce:0kB free_pcp:428kB local_pcp:72kB free_cma:0kB
> lowmem_reserve[]: 0 0 0 0
> DMA: 2*4kB (UM) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 0*1024kB 1*2048kB (U) 0*4096kB = 3192kB
> Normal: 33*4kB (MH) 79*8kB (ME) 11*16kB (M) 4*32kB (M) 2*64kB (ME) 2*128kB (EH) 7*256kB (EH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3244kB
> HighMem: 2590*4kB (UM) 1568*8kB (UM) 491*16kB (UM) 60*32kB (UM) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33064kB
> Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> 25121 total pagecache pages
> 24160 pages in swap cache
> Swap cache stats: add 86371, delete 62211, find 42865/60187
> Free swap  = 4015560kB
> Total swap = 4192252kB
> 524186 pages RAM
> 295934 pages HighMem/MovableOnly
> 9658 pages reserved
> 0 pages cma reserved
> 
> The order-0 allocation for normal zone failed while there are a lot of
> reclaimable memory(i.e., anonymous memory with free swap). I wanted to
> analyze the problem but it was hard because we removed per-zone lru stat
> so I couldn't know how many of anonymous memory there are in normal/dma zone.
> 
> When we investigate OOM problem, reclaimable memory count is crucial stat
> to find a problem. Without it, it's hard to parse the OOM message so I
> believe we should keep it.
> 
> With per-zone lru stat,
> 
> gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
> Mem-Info:
> active_anon:101103 inactive_anon:102219 isolated_anon:0
>  active_file:503 inactive_file:544 isolated_file:0
>  unevictable:0 dirty:0 writeback:34 unstable:0
>  slab_reclaimable:6298 slab_unreclaimable:74669
>  mapped:863 shmem:0 pagetables:100998 bounce:0
>  free:23573 free_pcp:1861 free_cma:0
> Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
> DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 809 1965 1965
> Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
> lowmem_reserve[]: 0 0 9247 9247
> HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
> lowmem_reserve[]: 0 0 0 0
> DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
> Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
> HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
> Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> 54409 total pagecache pages
> 53215 pages in swap cache
> Swap cache stats: add 300982, delete 247765, find 157978/226539
> Free swap  = 3803244kB
> Total swap = 4192252kB
> 524186 pages RAM
> 295934 pages HighMem/MovableOnly
> 9642 pages reserved
> 0 pages cma reserved
> 
> With that, we can see normal zone has a 86M reclaimable memory so we can
> know something goes wrong(I will fix the problem in next patch) in reclaim.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Yep, makes sense to retain that insight into zones.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/5] mm: add per-zone lru list stat
@ 2016-07-22 15:51     ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2016-07-22 15:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Thu, Jul 21, 2016 at 03:10:57PM +0100, Mel Gorman wrote:
> From: Minchan Kim <minchan@kernel.org>
> 
> While I did stress test with hackbench, I got OOM message frequently which
> didn't ever happen in zone-lru.
> 
> gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
> ..
> ..
>  [<c71a76e2>] __alloc_pages_nodemask+0xe52/0xe60
>  [<c71f31dc>] ? new_slab+0x39c/0x3b0
>  [<c71f31dc>] new_slab+0x39c/0x3b0
>  [<c71f4eca>] ___slab_alloc.constprop.87+0x6da/0x840
>  [<c763e6fc>] ? __alloc_skb+0x3c/0x260
>  [<c777e127>] ? _raw_spin_unlock_irq+0x27/0x60
>  [<c70cebfc>] ? trace_hardirqs_on_caller+0xec/0x1b0
>  [<c70a1506>] ? finish_task_switch+0xa6/0x220
>  [<c7219ee0>] ? poll_select_copy_remaining+0x140/0x140
>  [<c7201645>] __slab_alloc.isra.81.constprop.86+0x40/0x6d
>  [<c763e6fc>] ? __alloc_skb+0x3c/0x260
>  [<c71f525c>] kmem_cache_alloc+0x22c/0x260
>  [<c763e6fc>] ? __alloc_skb+0x3c/0x260
>  [<c763e6fc>] __alloc_skb+0x3c/0x260
>  [<c763eece>] alloc_skb_with_frags+0x4e/0x1a0
>  [<c7638d6a>] sock_alloc_send_pskb+0x16a/0x1b0
>  [<c770b581>] ? wait_for_unix_gc+0x31/0x90
>  [<c71cfb1d>] ? alloc_set_pte+0x2ad/0x310
>  [<c77084dd>] unix_stream_sendmsg+0x28d/0x340
>  [<c7634dad>] sock_sendmsg+0x2d/0x40
>  [<c7634e2c>] sock_write_iter+0x6c/0xc0
>  [<c7204a90>] __vfs_write+0xc0/0x120
>  [<c72053ab>] vfs_write+0x9b/0x1a0
>  [<c71cc4a9>] ? __might_fault+0x49/0xa0
>  [<c72062c4>] SyS_write+0x44/0x90
>  [<c70036c6>] do_fast_syscall_32+0xa6/0x1e0
>  [<c777ea2c>] sysenter_past_esp+0x45/0x74
> 
> Mem-Info:
> active_anon:104698 inactive_anon:105791 isolated_anon:192
>  active_file:433 inactive_file:283 isolated_file:22
>  unevictable:0 dirty:0 writeback:296 unstable:0
>  slab_reclaimable:6389 slab_unreclaimable:78927
>  mapped:474 shmem:0 pagetables:101426 bounce:0
>  free:10518 free_pcp:334 free_cma:0
> Node 0 active_anon:418792kB inactive_anon:423164kB active_file:1732kB inactive_file:1132kB unevictable:0kB isolated(anon):768kB isolated(file):88kB mapped:1896kB dirty:0kB writeback:1184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1478632 all_unreclaimable? yes
> DMA free:3304kB min:68kB low:84kB high:100kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:4088kB kernel_stack:0kB pagetables:2480kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 809 1965 1965
> Normal free:3436kB min:3604kB low:4504kB high:5404kB present:897016kB managed:858460kB mlocked:0kB slab_reclaimable:25556kB slab_unreclaimable:311712kB kernel_stack:164608kB pagetables:30844kB bounce:0kB free_pcp:620kB local_pcp:104kB free_cma:0kB
> lowmem_reserve[]: 0 0 9247 9247
> HighMem free:33808kB min:512kB low:1796kB high:3080kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:372252kB bounce:0kB free_pcp:428kB local_pcp:72kB free_cma:0kB
> lowmem_reserve[]: 0 0 0 0
> DMA: 2*4kB (UM) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 0*1024kB 1*2048kB (U) 0*4096kB = 3192kB
> Normal: 33*4kB (MH) 79*8kB (ME) 11*16kB (M) 4*32kB (M) 2*64kB (ME) 2*128kB (EH) 7*256kB (EH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3244kB
> HighMem: 2590*4kB (UM) 1568*8kB (UM) 491*16kB (UM) 60*32kB (UM) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33064kB
> Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> 25121 total pagecache pages
> 24160 pages in swap cache
> Swap cache stats: add 86371, delete 62211, find 42865/60187
> Free swap  = 4015560kB
> Total swap = 4192252kB
> 524186 pages RAM
> 295934 pages HighMem/MovableOnly
> 9658 pages reserved
> 0 pages cma reserved
> 
> The order-0 allocation for normal zone failed while there are a lot of
> reclaimable memory(i.e., anonymous memory with free swap). I wanted to
> analyze the problem but it was hard because we removed per-zone lru stat
> so I couldn't know how many of anonymous memory there are in normal/dma zone.
> 
> When we investigate OOM problem, reclaimable memory count is crucial stat
> to find a problem. Without it, it's hard to parse the OOM message so I
> believe we should keep it.
> 
> With per-zone lru stat,
> 
> gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
> Mem-Info:
> active_anon:101103 inactive_anon:102219 isolated_anon:0
>  active_file:503 inactive_file:544 isolated_file:0
>  unevictable:0 dirty:0 writeback:34 unstable:0
>  slab_reclaimable:6298 slab_unreclaimable:74669
>  mapped:863 shmem:0 pagetables:100998 bounce:0
>  free:23573 free_pcp:1861 free_cma:0
> Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
> DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 809 1965 1965
> Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
> lowmem_reserve[]: 0 0 9247 9247
> HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
> lowmem_reserve[]: 0 0 0 0
> DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
> Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
> HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
> Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> 54409 total pagecache pages
> 53215 pages in swap cache
> Swap cache stats: add 300982, delete 247765, find 157978/226539
> Free swap  = 3803244kB
> Total swap = 4192252kB
> 524186 pages RAM
> 295934 pages HighMem/MovableOnly
> 9642 pages reserved
> 0 pages cma reserved
> 
> With that, we can see normal zone has a 86M reclaimable memory so we can
> know something goes wrong(I will fix the problem in next patch) in reclaim.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Yep, makes sense to retain that insight into zones.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/5] mm, vmscan: Remove highmem_file_pages
  2016-07-21 14:10   ` Mel Gorman
@ 2016-07-22 15:53     ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2016-07-22 15:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Thu, Jul 21, 2016 at 03:10:58PM +0100, Mel Gorman wrote:
> With the reintroduction of per-zone LRU stats, highmem_file_pages is
> redundant so remove it.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/5] mm, vmscan: Remove highmem_file_pages
@ 2016-07-22 15:53     ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2016-07-22 15:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Thu, Jul 21, 2016 at 03:10:58PM +0100, Mel Gorman wrote:
> With the reintroduction of per-zone LRU stats, highmem_file_pages is
> redundant so remove it.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/5] mm: Remove reclaim and compaction retry approximations
  2016-07-21 14:10   ` Mel Gorman
@ 2016-07-22 15:57     ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2016-07-22 15:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Thu, Jul 21, 2016 at 03:10:59PM +0100, Mel Gorman wrote:
> If per-zone LRU accounting is available then there is no point
> approximating whether reclaim and compaction should retry based on pgdat
> statistics. This is effectively a revert of "mm, vmstat: remove zone and
> node double accounting by approximating retries" with the difference that
> inactive/active stats are still available. This preserves the history of
> why the approximation was retried and why it had to be reverted to handle
> OOM kills on 32-bit systems.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

I like this version of should_reclaim_retry() much better ;)

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/5] mm: Remove reclaim and compaction retry approximations
@ 2016-07-22 15:57     ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2016-07-22 15:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Thu, Jul 21, 2016 at 03:10:59PM +0100, Mel Gorman wrote:
> If per-zone LRU accounting is available then there is no point
> approximating whether reclaim and compaction should retry based on pgdat
> statistics. This is effectively a revert of "mm, vmstat: remove zone and
> node double accounting by approximating retries" with the difference that
> inactive/active stats are still available. This preserves the history of
> why the approximation was retried and why it had to be reverted to handle
> OOM kills on 32-bit systems.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

I like this version of should_reclaim_retry() much better ;)

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] mm, vmscan: Account for skipped pages as a partial scan
  2016-07-21 14:11   ` Mel Gorman
@ 2016-07-22 16:02     ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2016-07-22 16:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Thu, Jul 21, 2016 at 03:11:01PM +0100, Mel Gorman wrote:
> Page reclaim determines whether a pgdat is unreclaimable by examining how
> many pages have been scanned since a page was freed and comparing that to
> the LRU sizes. Skipped pages are not reclaim candidates but contribute to
> scanned. This can prematurely mark a pgdat as unreclaimable and trigger
> an OOM kill.
> 
> This patch accounts for skipped pages as a partial scan so that an
> unreclaimable pgdat will still be marked as such but by scaling the cost
> of a skip, it'll avoid the pgdat being marked prematurely.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] mm, vmscan: Account for skipped pages as a partial scan
@ 2016-07-22 16:02     ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2016-07-22 16:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Thu, Jul 21, 2016 at 03:11:01PM +0100, Mel Gorman wrote:
> Page reclaim determines whether a pgdat is unreclaimable by examining how
> many pages have been scanned since a page was freed and comparing that to
> the LRU sizes. Skipped pages are not reclaim candidates but contribute to
> scanned. This can prematurely mark a pgdat as unreclaimable and trigger
> an OOM kill.
> 
> This patch accounts for skipped pages as a partial scan so that an
> unreclaimable pgdat will still be marked as such but by scaling the cost
> of a skip, it'll avoid the pgdat being marked prematurely.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/5] mm, vmscan: Remove highmem_file_pages
  2016-07-21 14:10   ` Mel Gorman
@ 2016-07-25  8:09     ` Minchan Kim
  -1 siblings, 0 replies; 44+ messages in thread
From: Minchan Kim @ 2016-07-25  8:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Thu, Jul 21, 2016 at 03:10:58PM +0100, Mel Gorman wrote:
> With the reintroduction of per-zone LRU stats, highmem_file_pages is
> redundant so remove it.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  include/linux/mm_inline.h | 17 -----------------
>  mm/page-writeback.c       | 12 ++++--------
>  2 files changed, 4 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 9cc130f5feb2..71613e8a720f 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -4,22 +4,6 @@
>  #include <linux/huge_mm.h>
>  #include <linux/swap.h>
>  
> -#ifdef CONFIG_HIGHMEM
> -extern atomic_t highmem_file_pages;
> -
> -static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
> -							int nr_pages)
> -{
> -	if (is_highmem_idx(zid) && is_file_lru(lru))
> -		atomic_add(nr_pages, &highmem_file_pages);
> -}
> -#else
> -static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
> -							int nr_pages)
> -{
> -}
> -#endif
> -
>  /**
>   * page_is_file_cache - should the page be on a file LRU or anon LRU?
>   * @page: the page to test
> @@ -47,7 +31,6 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
>  	__mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
>  	__mod_zone_page_state(&pgdat->node_zones[zid],
>  				NR_ZONE_LRU_BASE + lru, nr_pages);
> -	acct_highmem_file_pages(zid, lru, nr_pages);
>  }
>  
>  static __always_inline void update_lru_size(struct lruvec *lruvec,
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 573d138fa7a5..cfa78124c3c2 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -299,17 +299,13 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat)
>  
>  	return nr_pages;
>  }
> -#ifdef CONFIG_HIGHMEM
> -atomic_t highmem_file_pages;
> -#endif
>  
>  static unsigned long highmem_dirtyable_memory(unsigned long total)
>  {
>  #ifdef CONFIG_HIGHMEM
>  	int node;
> -	unsigned long x;
> +	unsigned long x = 0;
>  	int i;
> -	unsigned long dirtyable = 0;
>  
>  	for_each_node_state(node, N_HIGH_MEMORY) {
>  		for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) {
> @@ -326,12 +322,12 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>  			nr_pages = zone_page_state(z, NR_FREE_PAGES);
>  			/* watch for underflows */
>  			nr_pages -= min(nr_pages, high_wmark_pages(z));
> -			dirtyable += nr_pages;
> +			nr_pages += zone_page_state(z, NR_INACTIVE_FILE);

                                                       NR_ZONE_INACTIVE_FILE

> +			nr_pages += zone_page_state(z, NR_ACTIVE_FILE);

                                                       NR_ZONE_ACTIVE_FILE

> +			x += nr_pages;

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/5] mm, vmscan: Remove highmem_file_pages
@ 2016-07-25  8:09     ` Minchan Kim
  0 siblings, 0 replies; 44+ messages in thread
From: Minchan Kim @ 2016-07-25  8:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Thu, Jul 21, 2016 at 03:10:58PM +0100, Mel Gorman wrote:
> With the reintroduction of per-zone LRU stats, highmem_file_pages is
> redundant so remove it.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  include/linux/mm_inline.h | 17 -----------------
>  mm/page-writeback.c       | 12 ++++--------
>  2 files changed, 4 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 9cc130f5feb2..71613e8a720f 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -4,22 +4,6 @@
>  #include <linux/huge_mm.h>
>  #include <linux/swap.h>
>  
> -#ifdef CONFIG_HIGHMEM
> -extern atomic_t highmem_file_pages;
> -
> -static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
> -							int nr_pages)
> -{
> -	if (is_highmem_idx(zid) && is_file_lru(lru))
> -		atomic_add(nr_pages, &highmem_file_pages);
> -}
> -#else
> -static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
> -							int nr_pages)
> -{
> -}
> -#endif
> -
>  /**
>   * page_is_file_cache - should the page be on a file LRU or anon LRU?
>   * @page: the page to test
> @@ -47,7 +31,6 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
>  	__mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
>  	__mod_zone_page_state(&pgdat->node_zones[zid],
>  				NR_ZONE_LRU_BASE + lru, nr_pages);
> -	acct_highmem_file_pages(zid, lru, nr_pages);
>  }
>  
>  static __always_inline void update_lru_size(struct lruvec *lruvec,
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 573d138fa7a5..cfa78124c3c2 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -299,17 +299,13 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat)
>  
>  	return nr_pages;
>  }
> -#ifdef CONFIG_HIGHMEM
> -atomic_t highmem_file_pages;
> -#endif
>  
>  static unsigned long highmem_dirtyable_memory(unsigned long total)
>  {
>  #ifdef CONFIG_HIGHMEM
>  	int node;
> -	unsigned long x;
> +	unsigned long x = 0;
>  	int i;
> -	unsigned long dirtyable = 0;
>  
>  	for_each_node_state(node, N_HIGH_MEMORY) {
>  		for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) {
> @@ -326,12 +322,12 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>  			nr_pages = zone_page_state(z, NR_FREE_PAGES);
>  			/* watch for underflows */
>  			nr_pages -= min(nr_pages, high_wmark_pages(z));
> -			dirtyable += nr_pages;
> +			nr_pages += zone_page_state(z, NR_INACTIVE_FILE);

                                                       NR_ZONE_INACTIVE_FILE

> +			nr_pages += zone_page_state(z, NR_ACTIVE_FILE);

                                                       NR_ZONE_ACTIVE_FILE

> +			x += nr_pages;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/5] mm: Remove reclaim and compaction retry approximations
  2016-07-21 14:10   ` Mel Gorman
@ 2016-07-25  8:18     ` Minchan Kim
  -1 siblings, 0 replies; 44+ messages in thread
From: Minchan Kim @ 2016-07-25  8:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Thu, Jul 21, 2016 at 03:10:59PM +0100, Mel Gorman wrote:
> If per-zone LRU accounting is available then there is no point
> approximating whether reclaim and compaction should retry based on pgdat
> statistics. This is effectively a revert of "mm, vmstat: remove zone and
> node double accounting by approximating retries" with the difference that
> inactive/active stats are still available. This preserves the history of
> why the approximation was retried and why it had to be reverted to handle
> OOM kills on 32-bit systems.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Minchan Kim <minchan@kernel.org>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/5] mm: Remove reclaim and compaction retry approximations
@ 2016-07-25  8:18     ` Minchan Kim
  0 siblings, 0 replies; 44+ messages in thread
From: Minchan Kim @ 2016-07-25  8:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Thu, Jul 21, 2016 at 03:10:59PM +0100, Mel Gorman wrote:
> If per-zone LRU accounting is available then there is no point
> approximating whether reclaim and compaction should retry based on pgdat
> statistics. This is effectively a revert of "mm, vmstat: remove zone and
> node double accounting by approximating retries" with the difference that
> inactive/active stats are still available. This preserves the history of
> why the approximation was retried and why it had to be reverted to handle
> OOM kills on 32-bit systems.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Minchan Kim <minchan@kernel.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] mm, vmscan: Account for skipped pages as a partial scan
  2016-07-21 14:11   ` Mel Gorman
@ 2016-07-25  8:39     ` Minchan Kim
  -1 siblings, 0 replies; 44+ messages in thread
From: Minchan Kim @ 2016-07-25  8:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Thu, Jul 21, 2016 at 03:11:01PM +0100, Mel Gorman wrote:
> Page reclaim determines whether a pgdat is unreclaimable by examining how
> many pages have been scanned since a page was freed and comparing that to
> the LRU sizes. Skipped pages are not reclaim candidates but contribute to
> scanned. This can prematurely mark a pgdat as unreclaimable and trigger
> an OOM kill.
> 
> This patch accounts for skipped pages as a partial scan so that an
> unreclaimable pgdat will still be marked as such but by scaling the cost
> of a skip, it'll avoid the pgdat being marked prematurely.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  mm/vmscan.c | 20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6810d81f60c7..e5af357dd4ac 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1424,7 +1424,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  	LIST_HEAD(pages_skipped);
>  
>  	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
> -					!list_empty(src); scan++) {
> +					!list_empty(src);) {
>  		struct page *page;
>  
>  		page = lru_to_page(src);
> @@ -1438,6 +1438,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  			continue;
>  		}
>  
> +		/*
> +		 * Account for scanned and skipped separetly to avoid the pgdat
> +		 * being prematurely marked unreclaimable by pgdat_reclaimable.
> +		 */
> +		scan++;
> +
>  		switch (__isolate_lru_page(page, mode)) {
>  		case 0:
>  			nr_pages = hpage_nr_pages(page);
> @@ -1465,14 +1471,24 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  	 */
>  	if (!list_empty(&pages_skipped)) {
>  		int zid;
> +		unsigned long total_skipped = 0;
>  
> -		list_splice(&pages_skipped, src);
>  		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>  			if (!nr_skipped[zid])
>  				continue;
>  
>  			__count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
> +			total_skipped += nr_skipped[zid];
>  		}
> +
> +		/*
> +		 * Account skipped pages as a partial scan as the pgdat may be
> +		 * close to unreclaimable. If the LRU list is empty, account
> +		 * skipped pages as a full scan.
> +		 */

node-lru made OOM detection lengthy because a freeing of any zone will
reset NR_PAGES_SCANNED easily so that it's hard to meet a situation
pgdat_reclaimable returns *false*.

When I perform stress test, it seems I encounter the situation easily
although I have no number now.

Anyway, this patch makes sense to me because it's better than now.
About accounting scan, I supports this idea.

But still, I doubt it's okay to continue skipping pages under
irq-disabled-spin lock without any condition.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] mm, vmscan: Account for skipped pages as a partial scan
@ 2016-07-25  8:39     ` Minchan Kim
  0 siblings, 0 replies; 44+ messages in thread
From: Minchan Kim @ 2016-07-25  8:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Thu, Jul 21, 2016 at 03:11:01PM +0100, Mel Gorman wrote:
> Page reclaim determines whether a pgdat is unreclaimable by examining how
> many pages have been scanned since a page was freed and comparing that to
> the LRU sizes. Skipped pages are not reclaim candidates but contribute to
> scanned. This can prematurely mark a pgdat as unreclaimable and trigger
> an OOM kill.
> 
> This patch accounts for skipped pages as a partial scan so that an
> unreclaimable pgdat will still be marked as such but by scaling the cost
> of a skip, it'll avoid the pgdat being marked prematurely.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  mm/vmscan.c | 20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6810d81f60c7..e5af357dd4ac 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1424,7 +1424,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  	LIST_HEAD(pages_skipped);
>  
>  	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
> -					!list_empty(src); scan++) {
> +					!list_empty(src);) {
>  		struct page *page;
>  
>  		page = lru_to_page(src);
> @@ -1438,6 +1438,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  			continue;
>  		}
>  
> +		/*
> +		 * Account for scanned and skipped separetly to avoid the pgdat
> +		 * being prematurely marked unreclaimable by pgdat_reclaimable.
> +		 */
> +		scan++;
> +
>  		switch (__isolate_lru_page(page, mode)) {
>  		case 0:
>  			nr_pages = hpage_nr_pages(page);
> @@ -1465,14 +1471,24 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  	 */
>  	if (!list_empty(&pages_skipped)) {
>  		int zid;
> +		unsigned long total_skipped = 0;
>  
> -		list_splice(&pages_skipped, src);
>  		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>  			if (!nr_skipped[zid])
>  				continue;
>  
>  			__count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
> +			total_skipped += nr_skipped[zid];
>  		}
> +
> +		/*
> +		 * Account skipped pages as a partial scan as the pgdat may be
> +		 * close to unreclaimable. If the LRU list is empty, account
> +		 * skipped pages as a full scan.
> +		 */

node-lru made OOM detection lengthy because a freeing of any zone will
reset NR_PAGES_SCANNED easily so that it's hard to meet a situation
pgdat_reclaimable returns *false*.

When I perform stress test, it seems I encounter the situation easily
although I have no number now.

Anyway, this patch makes sense to me because it's better than now.
About accounting scan, I supports this idea.

But still, I doubt it's okay to continue skipping pages under
irq-disabled-spin lock without any condition.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH] mm, vmscan: remove highmem_file_pages -fix
  2016-07-25  8:09     ` Minchan Kim
@ 2016-07-25  9:23       ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2016-07-25  9:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

The wrong stat is being accumulatedin highmem_dirtyable_memory, fix it.

This is a fix to the mmotm patch mm-vmscan-remove-highmem_file_pages.patch

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page-writeback.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7e9061ec040b..f4cd7d8005c9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -322,8 +322,8 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
 			nr_pages = zone_page_state(z, NR_FREE_PAGES);
 			/* watch for underflows */
 			nr_pages -= min(nr_pages, high_wmark_pages(z));
-			nr_pages += zone_page_state(z, NR_INACTIVE_FILE);
-			nr_pages += zone_page_state(z, NR_ACTIVE_FILE);
+			nr_pages += zone_page_state(z, NR_ZONE_INACTIVE_FILE);
+			nr_pages += zone_page_state(z, NR_ZONE_ACTIVE_FILE);
 			x += nr_pages;
 		}
 	}

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH] mm, vmscan: remove highmem_file_pages -fix
@ 2016-07-25  9:23       ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2016-07-25  9:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

The wrong stat is being accumulatedin highmem_dirtyable_memory, fix it.

This is a fix to the mmotm patch mm-vmscan-remove-highmem_file_pages.patch

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page-writeback.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7e9061ec040b..f4cd7d8005c9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -322,8 +322,8 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
 			nr_pages = zone_page_state(z, NR_FREE_PAGES);
 			/* watch for underflows */
 			nr_pages -= min(nr_pages, high_wmark_pages(z));
-			nr_pages += zone_page_state(z, NR_INACTIVE_FILE);
-			nr_pages += zone_page_state(z, NR_ACTIVE_FILE);
+			nr_pages += zone_page_state(z, NR_ZONE_INACTIVE_FILE);
+			nr_pages += zone_page_state(z, NR_ZONE_ACTIVE_FILE);
 			x += nr_pages;
 		}
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] mm, vmscan: Account for skipped pages as a partial scan
  2016-07-25  8:39     ` Minchan Kim
@ 2016-07-25  9:52       ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2016-07-25  9:52 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Mon, Jul 25, 2016 at 05:39:13PM +0900, Minchan Kim wrote:
> > @@ -1465,14 +1471,24 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >  	 */
> >  	if (!list_empty(&pages_skipped)) {
> >  		int zid;
> > +		unsigned long total_skipped = 0;
> >  
> > -		list_splice(&pages_skipped, src);
> >  		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> >  			if (!nr_skipped[zid])
> >  				continue;
> >  
> >  			__count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
> > +			total_skipped += nr_skipped[zid];
> >  		}
> > +
> > +		/*
> > +		 * Account skipped pages as a partial scan as the pgdat may be
> > +		 * close to unreclaimable. If the LRU list is empty, account
> > +		 * skipped pages as a full scan.
> > +		 */
> 
> node-lru made OOM detection lengthy because a freeing of any zone will
> reset NR_PAGES_SCANNED easily so that it's hard to meet a situation
> pgdat_reclaimable returns *false*.
> 

Your patch should go a long way to addressing that as it checks the zone
counters first before conducting the scan. Remember as well that the longer
detection of OOM only applies to zone-constrained allocations and there
is always the possibility that highmem shrinking of pages frees lowmem
memory if buffers are used.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] mm, vmscan: Account for skipped pages as a partial scan
@ 2016-07-25  9:52       ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2016-07-25  9:52 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Mon, Jul 25, 2016 at 05:39:13PM +0900, Minchan Kim wrote:
> > @@ -1465,14 +1471,24 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >  	 */
> >  	if (!list_empty(&pages_skipped)) {
> >  		int zid;
> > +		unsigned long total_skipped = 0;
> >  
> > -		list_splice(&pages_skipped, src);
> >  		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> >  			if (!nr_skipped[zid])
> >  				continue;
> >  
> >  			__count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
> > +			total_skipped += nr_skipped[zid];
> >  		}
> > +
> > +		/*
> > +		 * Account skipped pages as a partial scan as the pgdat may be
> > +		 * close to unreclaimable. If the LRU list is empty, account
> > +		 * skipped pages as a full scan.
> > +		 */
> 
> node-lru made OOM detection lengthy because a freeing of any zone will
> reset NR_PAGES_SCANNED easily so that it's hard to meet a situation
> pgdat_reclaimable returns *false*.
> 

Your patch should go a long way to addressing that as it checks the zone
counters first before conducting the scan. Remember as well that the longer
detection of OOM only applies to zone-constrained allocations and there
is always the possibility that highmem shrinking of pages frees lowmem
memory if buffers are used.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v2
  2016-07-21 14:10 ` Mel Gorman
@ 2016-07-26  8:11   ` Joonsoo Kim
  -1 siblings, 0 replies; 44+ messages in thread
From: Joonsoo Kim @ 2016-07-26  8:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Minchan Kim, Michal Hocko,
	Vlastimil Babka, Linux-MM, LKML

On Thu, Jul 21, 2016 at 03:10:56PM +0100, Mel Gorman wrote:
> Both Joonsoo Kim and Minchan Kim have reported premature OOM kills.
> The common element is a zone-constrained allocation failings. Two factors
> appear to be at fault -- pgdat being considered unreclaimable prematurely
> and insufficient rotation of the active list.
> 
> The series is in three basic parts;
> 
> Patches 1-3 add per-zone stats back in. The actual stats patch is different
> 	to Minchan's as the original patch did not account for unevictable
> 	LRU which would corrupt counters. The second two patches remove
> 	approximations based on pgdat statistics. It's effectively a
> 	revert of "mm, vmstat: remove zone and node double accounting
> 	by approximating retries" but different LRU stats are used. This
> 	is better than a full revert or a reworking of the series as it
> 	preserves history of why the zone stats are necessary.
> 
> 	If this work out, we may have to leave the double accounting in
> 	place for now until an alternative cheap solution presents itself.
> 
> Patch 4 rotates inactive/active lists for lowmem allocations. This is also
> 	quite different to Minchan's patch as the original patch did not
> 	account for memcg and would rotate if *any* eligible zone needed
> 	rotation which may rotate excessively. The new patch considers the
> 	ratio for all eligible zones which is more in line with node-lru
> 	in general.
> 
> Patch 5 accounts for skipped pages as partial scanned. This avoids the pgdat
> 	being prematurely marked unreclaimable while still allowing it to
> 	be marked unreclaimable if there are no reclaimable pages.
> 
> These patches did not OOM for me on a 2G 32-bit KVM instance while running
> a stress test for an hour. Preliminary tests on a 64-bit system using a
> parallel dd workload did not show anything alarming.
> 
> If an OOM is detected then please post the full OOM message.

Before attaching OOM message, I should note that my test case also triggers
OOM in old kernel if there are four parallel file-readers. With node-lru and
patch 1~5, OOM is triggered even if there are one or more parallel file-readers.
With node-lru and patch 1~4, OOM is triggered if there are two or more
parallel file-readers.

Here goes OOM message.

fork invoked oom-killer: gfp_mask=0x24200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0                                                                                                        [108/9620]
fork cpuset=/ mems_allowed=0
CPU: 0 PID: 4304 Comm: fork Not tainted 4.7.0-rc7-next-20160720+ #713
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
 0000000000000000 ffff8800209ab960 ffffffff8142bd03 ffff8800209abb58
 ffff8800209a0000 ffff8800209ab9d8 ffffffff81241a59 ffffffff81e70020
 ffff8800209ab988 ffffffff810dddcd ffff8800209ab9a8 0000000000000206
Call Trace:
 [<ffffffff8142bd03>] dump_stack+0x85/0xc2
 [<ffffffff81241a59>] dump_header+0x5c/0x22e
 [<ffffffff810dddcd>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff811b33e1>] oom_kill_process+0x221/0x3f0
 [<ffffffff811b3a22>] out_of_memory+0x422/0x560
 [<ffffffff811b9f69>] __alloc_pages_nodemask+0x1069/0x10c0
 [<ffffffff81211a41>] ? alloc_pages_vma+0xc1/0x300
 [<ffffffff81211a41>] alloc_pages_vma+0xc1/0x300
 [<ffffffff811e851f>] ? wp_page_copy+0x7f/0x640
 [<ffffffff811e851f>] wp_page_copy+0x7f/0x640
 [<ffffffff811e974b>] do_wp_page+0x13b/0x6e0
 [<ffffffff811ec704>] handle_mm_fault+0xaf4/0x1310
 [<ffffffff811ebc4b>] ? handle_mm_fault+0x3b/0x1310
 [<ffffffff8106eb90>] ? __do_page_fault+0x160/0x4e0
 [<ffffffff8106ec19>] __do_page_fault+0x1e9/0x4e0
 [<ffffffff8106efed>] trace_do_page_fault+0x5d/0x290
 [<ffffffff810674ca>] do_async_page_fault+0x1a/0xa0
 [<ffffffff8185bee8>] async_page_fault+0x28/0x30
 [<ffffffff810a73d3>] ? __task_pid_nr_ns+0xb3/0x1b0
 [<ffffffff8143ab9c>] ? __put_user_4+0x1c/0x30
 [<ffffffff810b7205>] ? schedule_tail+0x55/0x70
 [<ffffffff81859f3c>] ret_from_fork+0xc/0x40
Mem-Info:
active_anon:26762 inactive_anon:95 isolated_anon:0
 active_file:42543 inactive_file:347438 isolated_file:0
 unevictable:0 dirty:0 writeback:0 unstable:0
 slab_reclaimable:5476 slab_unreclaimable:23140
 mapped:389534 shmem:95 pagetables:20927 bounce:0
 free:6948 free_pcp:222 free_cma:0
Node 0 active_anon:107048kB inactive_anon:380kB active_file:170008kB inactive_file:1389752kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1558136kB dirty:0kB writeback:0kB shmem:0kB shmem_$
hp: 0kB shmem_pmdmapped: 0kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB pages_scanned:4697206 all_unreclaimable? yes
Node 0 DMA free:2168kB min:204kB low:252kB high:300kB active_anon:3544kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB sl$
b_reclaimable:0kB slab_unreclaimable:2684kB kernel_stack:1760kB pagetables:3092kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 493 493 1955
Node 0 DMA32 free:6508kB min:6492kB low:8112kB high:9732kB active_anon:81264kB inactive_anon:0kB active_file:101204kB inactive_file:228kB unevictable:0kB writepending:0kB present:2080632kB managed:508584k$
 mlocked:0kB slab_reclaimable:21904kB slab_unreclaimable:89876kB kernel_stack:46400kB pagetables:80616kB bounce:0kB free_pcp:544kB local_pcp:120kB free_cma:0kB
lowmem_reserve[]: 0 0 0 1462
Node 0 Movable free:19116kB min:19256kB low:24068kB high:28880kB active_anon:22240kB inactive_anon:380kB active_file:68812kB inactive_file:1389688kB unevictable:0kB writepending:0kB present:1535864kB mana$
ed:1500964kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:368kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 0*4kB 1*8kB (M) 1*16kB (U) 1*32kB (M) 1*64kB (U) 0*128kB 0*256kB 2*512kB (UM) 1*1024kB (U) 0*2048kB 0*4096kB = 2168kB
Node 0 DMA32: 51*4kB (UME) 96*8kB (ME) 46*16kB (UME) 41*32kB (ME) 32*64kB (ME) 11*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6476kB
Node 0 Movable: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 1*64kB (M) 0*128kB 1*256kB (M) 1*512kB (M) 0*1024kB 1*2048kB (M) 4*4096kB (M) = 19324kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
390134 total pagecache pages
0 pages in swap cache


> 
> Optionally please test without patch 5 if an OOM occurs.

Here goes without patch 5.

fork invoked oom-killer: gfp_mask=0x26000c0(GFP_KERNEL|__GFP_NOTRACK), order=0, oom_score_adj=0                                                                                                    [2[2/9152]
fork cpuset=/ mems_allowed=0
CPU: 5 PID: 1269 Comm: fork Not tainted 4.7.0-rc7-next-20160720+ #714
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
 0000000000000000 ffff8800136138e8 ffffffff8142bd23 ffff880013613ae0
 ffff88000fa6ca00 ffff880013613960 ffffffff81241a79 ffffffff81e70020
 ffff880013613910 ffffffff810dddcd ffff880013613930 0000000000000206
Call Trace:
 [<ffffffff8142bd23>] dump_stack+0x85/0xc2
 [<ffffffff81241a79>] dump_header+0x5c/0x22e
 [<ffffffff810dddcd>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff811b33e1>] oom_kill_process+0x221/0x3f0
 [<ffffffff811b3a22>] out_of_memory+0x422/0x560
 [<ffffffff811b9f69>] __alloc_pages_nodemask+0x1069/0x10c0
 [<ffffffff8120fb01>] ? alloc_pages_current+0xa1/0x1f0
 [<ffffffff8120fb01>] alloc_pages_current+0xa1/0x1f0
 [<ffffffff81219f33>] ? new_slab+0x473/0x5e0
 [<ffffffff81219f33>] new_slab+0x473/0x5e0
 [<ffffffff8121b16f>] ___slab_alloc+0x27f/0x550
 [<ffffffff8121b491>] ? __slab_alloc+0x51/0x90
 [<ffffffff81081e11>] ? copy_process.part.29+0xc11/0x1b90
 [<ffffffff81081e11>] ? copy_process.part.29+0xc11/0x1b90
 [<ffffffff8121b491>] __slab_alloc+0x51/0x90
 [<ffffffff8121b6dc>] kmem_cache_alloc+0x20c/0x2b0
 [<ffffffff81081e11>] ? copy_process.part.29+0xc11/0x1b90
 [<ffffffff81081e11>] copy_process.part.29+0xc11/0x1b90
 [<ffffffff81082f86>] _do_fork+0xe6/0x6a0
 [<ffffffff810835e9>] SyS_clone+0x19/0x20
 [<ffffffff81003e13>] do_syscall_64+0x73/0x1e0
 [<ffffffff81859dc3>] entry_SYSCALL64_slow_path+0x25/0x25
Mem-Info:
active_anon:26003 inactive_anon:95 isolated_anon:0
 active_file:289026 inactive_file:96101 isolated_file:21
 unevictable:0 dirty:0 writeback:0 unstable:0
 slab_reclaimable:6056 slab_unreclaimable:23737
 mapped:384788 shmem:95 pagetables:23282 bounce:0
 free:7815 free_pcp:179 free_cma:0
Node 0 active_anon:104012kB inactive_anon:380kB active_file:1156104kB inactive_file:384404kB unevictable:0kB isolated(anon):0kB isolated(file):84kB mapped:1539152kB dirty:0kB writeback:0kB shmem:0kB shmem_
thp: 0kB shmem_pmdmapped: 2048kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB pages_scanned:2512936 all_unreclaimable? yes
Node 0 DMA free:2172kB min:204kB low:252kB high:300kB active_anon:3204kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB sla
b_reclaimable:16kB slab_unreclaimable:2944kB kernel_stack:1584kB pagetables:3188kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 493 493 1955
Node 0 DMA32 free:6320kB min:6492kB low:8112kB high:9732kB active_anon:79128kB inactive_anon:0kB active_file:69016kB inactive_file:15872kB unevictable:0kB writepending:0kB present:2080632kB managed:508584k
B mlocked:0kB slab_reclaimable:24208kB slab_unreclaimable:92004kB kernel_stack:44064kB pagetables:89940kB bounce:0kB free_pcp:264kB local_pcp:100kB free_cma:0kB
lowmem_reserve[]: 0 0 0 1462
Node 0 Movable free:22768kB min:19256kB low:24068kB high:28880kB active_anon:21676kB inactive_anon:380kB active_file:1085592kB inactive_file:369724kB unevictable:0kB writepending:0kB present:1535864kB mana
ged:1500964kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:452kB local_pcp:80kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 3*4kB (M) 0*8kB 1*16kB (M) 1*32kB (M) 1*64kB (M) 0*128kB 2*256kB (UM) 1*512kB (M) 1*1024kB (U) 0*2048kB 0*4096kB = 2172kB
Node 0 DMA32: 94*4kB (ME) 48*8kB (ME) 22*16kB (ME) 10*32kB (UME) 3*64kB (ME) 1*128kB (M) 0*256kB 2*512kB (UM) 4*1024kB (M) 0*2048kB 0*4096kB = 6872kB
Node 0 Movable: 0*4kB 0*8kB 1*16kB (M) 3*32kB (M) 4*64kB (M) 1*128kB (M) 10*256kB (M) 3*512kB (M) 0*1024kB 1*2048kB (M) 4*4096kB (M) = 23024kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
385234 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0

Thanks.

>  include/linux/mm_inline.h | 19 ++---------
>  include/linux/mmzone.h    |  7 ++++
>  include/linux/swap.h      |  1 +
>  mm/compaction.c           | 20 +----------
>  mm/migrate.c              |  2 ++
>  mm/page-writeback.c       | 17 +++++-----
>  mm/page_alloc.c           | 59 +++++++++++----------------------
>  mm/vmscan.c               | 84 ++++++++++++++++++++++++++++++++++++++---------
>  mm/vmstat.c               |  6 ++++
>  9 files changed, 116 insertions(+), 99 deletions(-)
> 
> -- 
> 2.6.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v2
@ 2016-07-26  8:11   ` Joonsoo Kim
  0 siblings, 0 replies; 44+ messages in thread
From: Joonsoo Kim @ 2016-07-26  8:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Minchan Kim, Michal Hocko,
	Vlastimil Babka, Linux-MM, LKML

On Thu, Jul 21, 2016 at 03:10:56PM +0100, Mel Gorman wrote:
> Both Joonsoo Kim and Minchan Kim have reported premature OOM kills.
> The common element is a zone-constrained allocation failings. Two factors
> appear to be at fault -- pgdat being considered unreclaimable prematurely
> and insufficient rotation of the active list.
> 
> The series is in three basic parts;
> 
> Patches 1-3 add per-zone stats back in. The actual stats patch is different
> 	to Minchan's as the original patch did not account for unevictable
> 	LRU which would corrupt counters. The second two patches remove
> 	approximations based on pgdat statistics. It's effectively a
> 	revert of "mm, vmstat: remove zone and node double accounting
> 	by approximating retries" but different LRU stats are used. This
> 	is better than a full revert or a reworking of the series as it
> 	preserves history of why the zone stats are necessary.
> 
> 	If this work out, we may have to leave the double accounting in
> 	place for now until an alternative cheap solution presents itself.
> 
> Patch 4 rotates inactive/active lists for lowmem allocations. This is also
> 	quite different to Minchan's patch as the original patch did not
> 	account for memcg and would rotate if *any* eligible zone needed
> 	rotation which may rotate excessively. The new patch considers the
> 	ratio for all eligible zones which is more in line with node-lru
> 	in general.
> 
> Patch 5 accounts for skipped pages as partial scanned. This avoids the pgdat
> 	being prematurely marked unreclaimable while still allowing it to
> 	be marked unreclaimable if there are no reclaimable pages.
> 
> These patches did not OOM for me on a 2G 32-bit KVM instance while running
> a stress test for an hour. Preliminary tests on a 64-bit system using a
> parallel dd workload did not show anything alarming.
> 
> If an OOM is detected then please post the full OOM message.

Before attaching OOM message, I should note that my test case also triggers
OOM in old kernel if there are four parallel file-readers. With node-lru and
patch 1~5, OOM is triggered even if there are one or more parallel file-readers.
With node-lru and patch 1~4, OOM is triggered if there are two or more
parallel file-readers.

Here goes OOM message.

fork invoked oom-killer: gfp_mask=0x24200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0                                                                                                        [108/9620]
fork cpuset=/ mems_allowed=0
CPU: 0 PID: 4304 Comm: fork Not tainted 4.7.0-rc7-next-20160720+ #713
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
 0000000000000000 ffff8800209ab960 ffffffff8142bd03 ffff8800209abb58
 ffff8800209a0000 ffff8800209ab9d8 ffffffff81241a59 ffffffff81e70020
 ffff8800209ab988 ffffffff810dddcd ffff8800209ab9a8 0000000000000206
Call Trace:
 [<ffffffff8142bd03>] dump_stack+0x85/0xc2
 [<ffffffff81241a59>] dump_header+0x5c/0x22e
 [<ffffffff810dddcd>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff811b33e1>] oom_kill_process+0x221/0x3f0
 [<ffffffff811b3a22>] out_of_memory+0x422/0x560
 [<ffffffff811b9f69>] __alloc_pages_nodemask+0x1069/0x10c0
 [<ffffffff81211a41>] ? alloc_pages_vma+0xc1/0x300
 [<ffffffff81211a41>] alloc_pages_vma+0xc1/0x300
 [<ffffffff811e851f>] ? wp_page_copy+0x7f/0x640
 [<ffffffff811e851f>] wp_page_copy+0x7f/0x640
 [<ffffffff811e974b>] do_wp_page+0x13b/0x6e0
 [<ffffffff811ec704>] handle_mm_fault+0xaf4/0x1310
 [<ffffffff811ebc4b>] ? handle_mm_fault+0x3b/0x1310
 [<ffffffff8106eb90>] ? __do_page_fault+0x160/0x4e0
 [<ffffffff8106ec19>] __do_page_fault+0x1e9/0x4e0
 [<ffffffff8106efed>] trace_do_page_fault+0x5d/0x290
 [<ffffffff810674ca>] do_async_page_fault+0x1a/0xa0
 [<ffffffff8185bee8>] async_page_fault+0x28/0x30
 [<ffffffff810a73d3>] ? __task_pid_nr_ns+0xb3/0x1b0
 [<ffffffff8143ab9c>] ? __put_user_4+0x1c/0x30
 [<ffffffff810b7205>] ? schedule_tail+0x55/0x70
 [<ffffffff81859f3c>] ret_from_fork+0xc/0x40
Mem-Info:
active_anon:26762 inactive_anon:95 isolated_anon:0
 active_file:42543 inactive_file:347438 isolated_file:0
 unevictable:0 dirty:0 writeback:0 unstable:0
 slab_reclaimable:5476 slab_unreclaimable:23140
 mapped:389534 shmem:95 pagetables:20927 bounce:0
 free:6948 free_pcp:222 free_cma:0
Node 0 active_anon:107048kB inactive_anon:380kB active_file:170008kB inactive_file:1389752kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1558136kB dirty:0kB writeback:0kB shmem:0kB shmem_$
hp: 0kB shmem_pmdmapped: 0kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB pages_scanned:4697206 all_unreclaimable? yes
Node 0 DMA free:2168kB min:204kB low:252kB high:300kB active_anon:3544kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB sl$
b_reclaimable:0kB slab_unreclaimable:2684kB kernel_stack:1760kB pagetables:3092kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 493 493 1955
Node 0 DMA32 free:6508kB min:6492kB low:8112kB high:9732kB active_anon:81264kB inactive_anon:0kB active_file:101204kB inactive_file:228kB unevictable:0kB writepending:0kB present:2080632kB managed:508584k$
 mlocked:0kB slab_reclaimable:21904kB slab_unreclaimable:89876kB kernel_stack:46400kB pagetables:80616kB bounce:0kB free_pcp:544kB local_pcp:120kB free_cma:0kB
lowmem_reserve[]: 0 0 0 1462
Node 0 Movable free:19116kB min:19256kB low:24068kB high:28880kB active_anon:22240kB inactive_anon:380kB active_file:68812kB inactive_file:1389688kB unevictable:0kB writepending:0kB present:1535864kB mana$
ed:1500964kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:368kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 0*4kB 1*8kB (M) 1*16kB (U) 1*32kB (M) 1*64kB (U) 0*128kB 0*256kB 2*512kB (UM) 1*1024kB (U) 0*2048kB 0*4096kB = 2168kB
Node 0 DMA32: 51*4kB (UME) 96*8kB (ME) 46*16kB (UME) 41*32kB (ME) 32*64kB (ME) 11*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6476kB
Node 0 Movable: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 1*64kB (M) 0*128kB 1*256kB (M) 1*512kB (M) 0*1024kB 1*2048kB (M) 4*4096kB (M) = 19324kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
390134 total pagecache pages
0 pages in swap cache


> 
> Optionally please test without patch 5 if an OOM occurs.

Here goes without patch 5.

fork invoked oom-killer: gfp_mask=0x26000c0(GFP_KERNEL|__GFP_NOTRACK), order=0, oom_score_adj=0                                                                                                    [2[2/9152]
fork cpuset=/ mems_allowed=0
CPU: 5 PID: 1269 Comm: fork Not tainted 4.7.0-rc7-next-20160720+ #714
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
 0000000000000000 ffff8800136138e8 ffffffff8142bd23 ffff880013613ae0
 ffff88000fa6ca00 ffff880013613960 ffffffff81241a79 ffffffff81e70020
 ffff880013613910 ffffffff810dddcd ffff880013613930 0000000000000206
Call Trace:
 [<ffffffff8142bd23>] dump_stack+0x85/0xc2
 [<ffffffff81241a79>] dump_header+0x5c/0x22e
 [<ffffffff810dddcd>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff811b33e1>] oom_kill_process+0x221/0x3f0
 [<ffffffff811b3a22>] out_of_memory+0x422/0x560
 [<ffffffff811b9f69>] __alloc_pages_nodemask+0x1069/0x10c0
 [<ffffffff8120fb01>] ? alloc_pages_current+0xa1/0x1f0
 [<ffffffff8120fb01>] alloc_pages_current+0xa1/0x1f0
 [<ffffffff81219f33>] ? new_slab+0x473/0x5e0
 [<ffffffff81219f33>] new_slab+0x473/0x5e0
 [<ffffffff8121b16f>] ___slab_alloc+0x27f/0x550
 [<ffffffff8121b491>] ? __slab_alloc+0x51/0x90
 [<ffffffff81081e11>] ? copy_process.part.29+0xc11/0x1b90
 [<ffffffff81081e11>] ? copy_process.part.29+0xc11/0x1b90
 [<ffffffff8121b491>] __slab_alloc+0x51/0x90
 [<ffffffff8121b6dc>] kmem_cache_alloc+0x20c/0x2b0
 [<ffffffff81081e11>] ? copy_process.part.29+0xc11/0x1b90
 [<ffffffff81081e11>] copy_process.part.29+0xc11/0x1b90
 [<ffffffff81082f86>] _do_fork+0xe6/0x6a0
 [<ffffffff810835e9>] SyS_clone+0x19/0x20
 [<ffffffff81003e13>] do_syscall_64+0x73/0x1e0
 [<ffffffff81859dc3>] entry_SYSCALL64_slow_path+0x25/0x25
Mem-Info:
active_anon:26003 inactive_anon:95 isolated_anon:0
 active_file:289026 inactive_file:96101 isolated_file:21
 unevictable:0 dirty:0 writeback:0 unstable:0
 slab_reclaimable:6056 slab_unreclaimable:23737
 mapped:384788 shmem:95 pagetables:23282 bounce:0
 free:7815 free_pcp:179 free_cma:0
Node 0 active_anon:104012kB inactive_anon:380kB active_file:1156104kB inactive_file:384404kB unevictable:0kB isolated(anon):0kB isolated(file):84kB mapped:1539152kB dirty:0kB writeback:0kB shmem:0kB shmem_
thp: 0kB shmem_pmdmapped: 2048kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB pages_scanned:2512936 all_unreclaimable? yes
Node 0 DMA free:2172kB min:204kB low:252kB high:300kB active_anon:3204kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB sla
b_reclaimable:16kB slab_unreclaimable:2944kB kernel_stack:1584kB pagetables:3188kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 493 493 1955
Node 0 DMA32 free:6320kB min:6492kB low:8112kB high:9732kB active_anon:79128kB inactive_anon:0kB active_file:69016kB inactive_file:15872kB unevictable:0kB writepending:0kB present:2080632kB managed:508584k
B mlocked:0kB slab_reclaimable:24208kB slab_unreclaimable:92004kB kernel_stack:44064kB pagetables:89940kB bounce:0kB free_pcp:264kB local_pcp:100kB free_cma:0kB
lowmem_reserve[]: 0 0 0 1462
Node 0 Movable free:22768kB min:19256kB low:24068kB high:28880kB active_anon:21676kB inactive_anon:380kB active_file:1085592kB inactive_file:369724kB unevictable:0kB writepending:0kB present:1535864kB mana
ged:1500964kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:452kB local_pcp:80kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 3*4kB (M) 0*8kB 1*16kB (M) 1*32kB (M) 1*64kB (M) 0*128kB 2*256kB (UM) 1*512kB (M) 1*1024kB (U) 0*2048kB 0*4096kB = 2172kB
Node 0 DMA32: 94*4kB (ME) 48*8kB (ME) 22*16kB (ME) 10*32kB (UME) 3*64kB (ME) 1*128kB (M) 0*256kB 2*512kB (UM) 4*1024kB (M) 0*2048kB 0*4096kB = 6872kB
Node 0 Movable: 0*4kB 0*8kB 1*16kB (M) 3*32kB (M) 4*64kB (M) 1*128kB (M) 10*256kB (M) 3*512kB (M) 0*1024kB 1*2048kB (M) 4*4096kB (M) = 23024kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
385234 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0

Thanks.

>  include/linux/mm_inline.h | 19 ++---------
>  include/linux/mmzone.h    |  7 ++++
>  include/linux/swap.h      |  1 +
>  mm/compaction.c           | 20 +----------
>  mm/migrate.c              |  2 ++
>  mm/page-writeback.c       | 17 +++++-----
>  mm/page_alloc.c           | 59 +++++++++++----------------------
>  mm/vmscan.c               | 84 ++++++++++++++++++++++++++++++++++++++---------
>  mm/vmstat.c               |  6 ++++
>  9 files changed, 116 insertions(+), 99 deletions(-)
> 
> -- 
> 2.6.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] mm, vmscan: Account for skipped pages as a partial scan
  2016-07-21 14:11   ` Mel Gorman
@ 2016-07-26  8:16     ` Joonsoo Kim
  -1 siblings, 0 replies; 44+ messages in thread
From: Joonsoo Kim @ 2016-07-26  8:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Minchan Kim, Michal Hocko,
	Vlastimil Babka, Linux-MM, LKML

On Thu, Jul 21, 2016 at 03:11:01PM +0100, Mel Gorman wrote:
> Page reclaim determines whether a pgdat is unreclaimable by examining how
> many pages have been scanned since a page was freed and comparing that to
> the LRU sizes. Skipped pages are not reclaim candidates but contribute to
> scanned. This can prematurely mark a pgdat as unreclaimable and trigger
> an OOM kill.
> 
> This patch accounts for skipped pages as a partial scan so that an
> unreclaimable pgdat will still be marked as such but by scaling the cost
> of a skip, it'll avoid the pgdat being marked prematurely.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  mm/vmscan.c | 20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6810d81f60c7..e5af357dd4ac 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1424,7 +1424,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  	LIST_HEAD(pages_skipped);
>  
>  	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
> -					!list_empty(src); scan++) {
> +					!list_empty(src);) {
>  		struct page *page;
>  
>  		page = lru_to_page(src);
> @@ -1438,6 +1438,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  			continue;
>  		}
>  
> +		/*
> +		 * Account for scanned and skipped separetly to avoid the pgdat
> +		 * being prematurely marked unreclaimable by pgdat_reclaimable.
> +		 */
> +		scan++;
> +

This logic has potential unbounded retry problem. src would not become
empty if __isolate_lru_page() return -EBUSY since we move failed page
to src list again in this case.

Thanks.

>  		switch (__isolate_lru_page(page, mode)) {
>  		case 0:
>  			nr_pages = hpage_nr_pages(page);
> @@ -1465,14 +1471,24 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  	 */
>  	if (!list_empty(&pages_skipped)) {
>  		int zid;
> +		unsigned long total_skipped = 0;
>  
> -		list_splice(&pages_skipped, src);
>  		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>  			if (!nr_skipped[zid])
>  				continue;
>  
>  			__count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
> +			total_skipped += nr_skipped[zid];
>  		}
> +
> +		/*
> +		 * Account skipped pages as a partial scan as the pgdat may be
> +		 * close to unreclaimable. If the LRU list is empty, account
> +		 * skipped pages as a full scan.
> +		 */
> +		scan += list_empty(src) ? total_skipped : total_skipped >> 2;
> +
> +		list_splice(&pages_skipped, src);
>  	}
>  	*nr_scanned = scan;
>  	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan,
> -- 
> 2.6.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] mm, vmscan: Account for skipped pages as a partial scan
@ 2016-07-26  8:16     ` Joonsoo Kim
  0 siblings, 0 replies; 44+ messages in thread
From: Joonsoo Kim @ 2016-07-26  8:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Minchan Kim, Michal Hocko,
	Vlastimil Babka, Linux-MM, LKML

On Thu, Jul 21, 2016 at 03:11:01PM +0100, Mel Gorman wrote:
> Page reclaim determines whether a pgdat is unreclaimable by examining how
> many pages have been scanned since a page was freed and comparing that to
> the LRU sizes. Skipped pages are not reclaim candidates but contribute to
> scanned. This can prematurely mark a pgdat as unreclaimable and trigger
> an OOM kill.
> 
> This patch accounts for skipped pages as a partial scan so that an
> unreclaimable pgdat will still be marked as such but by scaling the cost
> of a skip, it'll avoid the pgdat being marked prematurely.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  mm/vmscan.c | 20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6810d81f60c7..e5af357dd4ac 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1424,7 +1424,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  	LIST_HEAD(pages_skipped);
>  
>  	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
> -					!list_empty(src); scan++) {
> +					!list_empty(src);) {
>  		struct page *page;
>  
>  		page = lru_to_page(src);
> @@ -1438,6 +1438,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  			continue;
>  		}
>  
> +		/*
> +		 * Account for scanned and skipped separetly to avoid the pgdat
> +		 * being prematurely marked unreclaimable by pgdat_reclaimable.
> +		 */
> +		scan++;
> +

This logic has potential unbounded retry problem. src would not become
empty if __isolate_lru_page() return -EBUSY since we move failed page
to src list again in this case.

Thanks.

>  		switch (__isolate_lru_page(page, mode)) {
>  		case 0:
>  			nr_pages = hpage_nr_pages(page);
> @@ -1465,14 +1471,24 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  	 */
>  	if (!list_empty(&pages_skipped)) {
>  		int zid;
> +		unsigned long total_skipped = 0;
>  
> -		list_splice(&pages_skipped, src);
>  		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>  			if (!nr_skipped[zid])
>  				continue;
>  
>  			__count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
> +			total_skipped += nr_skipped[zid];
>  		}
> +
> +		/*
> +		 * Account skipped pages as a partial scan as the pgdat may be
> +		 * close to unreclaimable. If the LRU list is empty, account
> +		 * skipped pages as a full scan.
> +		 */
> +		scan += list_empty(src) ? total_skipped : total_skipped >> 2;
> +
> +		list_splice(&pages_skipped, src);
>  	}
>  	*nr_scanned = scan;
>  	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan,
> -- 
> 2.6.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] mm, vmscan: Account for skipped pages as a partial scan
  2016-07-26  8:16     ` Joonsoo Kim
@ 2016-07-26  8:26       ` Joonsoo Kim
  -1 siblings, 0 replies; 44+ messages in thread
From: Joonsoo Kim @ 2016-07-26  8:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Minchan Kim, Michal Hocko,
	Vlastimil Babka, Linux-MM, LKML

On Tue, Jul 26, 2016 at 05:16:22PM +0900, Joonsoo Kim wrote:
> On Thu, Jul 21, 2016 at 03:11:01PM +0100, Mel Gorman wrote:
> > Page reclaim determines whether a pgdat is unreclaimable by examining how
> > many pages have been scanned since a page was freed and comparing that to
> > the LRU sizes. Skipped pages are not reclaim candidates but contribute to
> > scanned. This can prematurely mark a pgdat as unreclaimable and trigger
> > an OOM kill.
> > 
> > This patch accounts for skipped pages as a partial scan so that an
> > unreclaimable pgdat will still be marked as such but by scaling the cost
> > of a skip, it'll avoid the pgdat being marked prematurely.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> >  mm/vmscan.c | 20 ++++++++++++++++++--
> >  1 file changed, 18 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 6810d81f60c7..e5af357dd4ac 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1424,7 +1424,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >  	LIST_HEAD(pages_skipped);
> >  
> >  	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
> > -					!list_empty(src); scan++) {
> > +					!list_empty(src);) {
> >  		struct page *page;
> >  
> >  		page = lru_to_page(src);
> > @@ -1438,6 +1438,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >  			continue;
> >  		}
> >  
> > +		/*
> > +		 * Account for scanned and skipped separetly to avoid the pgdat
> > +		 * being prematurely marked unreclaimable by pgdat_reclaimable.
> > +		 */
> > +		scan++;
> > +
> 
> This logic has potential unbounded retry problem. src would not become
> empty if __isolate_lru_page() return -EBUSY since we move failed page
> to src list again in this case.

Oops.. It would not unbounded retry. It would cause needless retry but
bounded. Sorry about noise.

Thanks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] mm, vmscan: Account for skipped pages as a partial scan
@ 2016-07-26  8:26       ` Joonsoo Kim
  0 siblings, 0 replies; 44+ messages in thread
From: Joonsoo Kim @ 2016-07-26  8:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Minchan Kim, Michal Hocko,
	Vlastimil Babka, Linux-MM, LKML

On Tue, Jul 26, 2016 at 05:16:22PM +0900, Joonsoo Kim wrote:
> On Thu, Jul 21, 2016 at 03:11:01PM +0100, Mel Gorman wrote:
> > Page reclaim determines whether a pgdat is unreclaimable by examining how
> > many pages have been scanned since a page was freed and comparing that to
> > the LRU sizes. Skipped pages are not reclaim candidates but contribute to
> > scanned. This can prematurely mark a pgdat as unreclaimable and trigger
> > an OOM kill.
> > 
> > This patch accounts for skipped pages as a partial scan so that an
> > unreclaimable pgdat will still be marked as such but by scaling the cost
> > of a skip, it'll avoid the pgdat being marked prematurely.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> >  mm/vmscan.c | 20 ++++++++++++++++++--
> >  1 file changed, 18 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 6810d81f60c7..e5af357dd4ac 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1424,7 +1424,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >  	LIST_HEAD(pages_skipped);
> >  
> >  	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
> > -					!list_empty(src); scan++) {
> > +					!list_empty(src);) {
> >  		struct page *page;
> >  
> >  		page = lru_to_page(src);
> > @@ -1438,6 +1438,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >  			continue;
> >  		}
> >  
> > +		/*
> > +		 * Account for scanned and skipped separetly to avoid the pgdat
> > +		 * being prematurely marked unreclaimable by pgdat_reclaimable.
> > +		 */
> > +		scan++;
> > +
> 
> This logic has potential unbounded retry problem. src would not become
> empty if __isolate_lru_page() return -EBUSY since we move failed page
> to src list again in this case.

Oops.. It would not unbounded retry. It would cause needless retry but
bounded. Sorry about noise.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v2
  2016-07-26  8:11   ` Joonsoo Kim
@ 2016-07-26 12:50     ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2016-07-26 12:50 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Johannes Weiner, Minchan Kim, Michal Hocko,
	Vlastimil Babka, Linux-MM, LKML

On Tue, Jul 26, 2016 at 05:11:30PM +0900, Joonsoo Kim wrote:
> > These patches did not OOM for me on a 2G 32-bit KVM instance while running
> > a stress test for an hour. Preliminary tests on a 64-bit system using a
> > parallel dd workload did not show anything alarming.
> > 
> > If an OOM is detected then please post the full OOM message.
> 
> Before attaching OOM message, I should note that my test case also triggers
> OOM in old kernel if there are four parallel file-readers. With node-lru and
> patch 1~5, OOM is triggered even if there are one or more parallel file-readers.
> With node-lru and patch 1~4, OOM is triggered if there are two or more
> parallel file-readers.
> 

The key there is that patch 5 allows OOM to be detected quicker. The fork
workload exits after some time so it's inherently a race to see if the
forked process exits before OOM is triggered or not.

> <SNIP>
> Mem-Info:
> active_anon:26762 inactive_anon:95 isolated_anon:0
>  active_file:42543 inactive_file:347438 isolated_file:0
>  unevictable:0 dirty:0 writeback:0 unstable:0
>  slab_reclaimable:5476 slab_unreclaimable:23140
>  mapped:389534 shmem:95 pagetables:20927 bounce:0
>  free:6948 free_pcp:222 free_cma:0
> Node 0 active_anon:107048kB inactive_anon:380kB active_file:170008kB inactive_file:1389752kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1558136kB dirty:0kB writeback:0kB shmem:0kB shmem_$
> hp: 0kB shmem_pmdmapped: 0kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB pages_scanned:4697206 all_unreclaimable? yes
> Node 0 DMA free:2168kB min:204kB low:252kB high:300kB active_anon:3544kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB sl$
> b_reclaimable:0kB slab_unreclaimable:2684kB kernel_stack:1760kB pagetables:3092kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 493 493 1955

Zone DMA is unusable

> Node 0 DMA32 free:6508kB min:6492kB low:8112kB high:9732kB active_anon:81264kB inactive_anon:0kB active_file:101204kB inactive_file:228kB unevictable:0kB writepending:0kB present:2080632kB managed:508584k$
>  mlocked:0kB slab_reclaimable:21904kB slab_unreclaimable:89876kB kernel_stack:46400kB pagetables:80616kB bounce:0kB free_pcp:544kB local_pcp:120kB free_cma:0kB
> lowmem_reserve[]: 0 0 0 1462

Zone DMA32 has reclaimable pages but not very many and they are active. It's
at the min watemark. The pgdat is unreclaimable indicating that scans
are high which implies that the active file pages are due to genuine
activations.

> Node 0 Movable free:19116kB min:19256kB low:24068kB high:28880kB active_anon:22240kB inactive_anon:380kB active_file:68812kB inactive_file:1389688kB unevictable:0kB writepending:0kB present:1535864kB mana$
> ed:1500964kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:368kB local_pcp:0kB free_cma:0kB

Zone Movable has reclaimable pages but it's at the min watermark and
scanning aggressively.

As the failing allocation can use all allocations, this appears to be close
to a genuine OOM case. Whether it survives is down to timing of when OOM
is triggered and whether the forked process exits in time or not.

To some extent, it could be "addressed" by immediately reclaiming active
pages moving to the inactive list at the cost of distorting page age for a
workload that is genuinely close to OOM. That is similar to what zone-lru
ended up doing -- fast reclaiming young pages from a zone.

> > Optionally please test without patch 5 if an OOM occurs.
> 
> Here goes without patch 5.
> 

Causing OOM detection to be delayed. Observations on the OOM message
without patch 5 are similar.

Do you mind trying the following? In the patch there is a line

scan += list_empty(src) ? total_skipped : total_skipped >> 2;

Try 

scan += list_empty(src) ? total_skipped : total_skipped >> 3;
scan += list_empty(src) ? total_skipped : total_skipped >> 4;
scan += total_skipped >> 4;

Each line slows the rate that OOM is detected but it'll be somewhat
specific to your test case as it's relying to fork to exit before OOM is
fired.

A hackier option that is also related to the fact fork is a major source
of the OOM triggering is to increase the zone reserve. That would give
more space for the fork bomb while giving the file reader slightly less
memory to work with. Again, what this is doing is simply altering OOM
timing because indications are the stress workload is genuinely close to
OOM.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 08ae8b0ef5c5..cedc8113c7a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -201,9 +201,9 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
 	 256,
 #endif
 #ifdef CONFIG_HIGHMEM
-	 32,
+	 8,
 #endif
-	 32,
+	 8,
 };

 EXPORT_SYMBOL(totalram_pages);

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v2
@ 2016-07-26 12:50     ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2016-07-26 12:50 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Johannes Weiner, Minchan Kim, Michal Hocko,
	Vlastimil Babka, Linux-MM, LKML

On Tue, Jul 26, 2016 at 05:11:30PM +0900, Joonsoo Kim wrote:
> > These patches did not OOM for me on a 2G 32-bit KVM instance while running
> > a stress test for an hour. Preliminary tests on a 64-bit system using a
> > parallel dd workload did not show anything alarming.
> > 
> > If an OOM is detected then please post the full OOM message.
> 
> Before attaching OOM message, I should note that my test case also triggers
> OOM in old kernel if there are four parallel file-readers. With node-lru and
> patch 1~5, OOM is triggered even if there are one or more parallel file-readers.
> With node-lru and patch 1~4, OOM is triggered if there are two or more
> parallel file-readers.
> 

The key there is that patch 5 allows OOM to be detected quicker. The fork
workload exits after some time so it's inherently a race to see if the
forked process exits before OOM is triggered or not.

> <SNIP>
> Mem-Info:
> active_anon:26762 inactive_anon:95 isolated_anon:0
>  active_file:42543 inactive_file:347438 isolated_file:0
>  unevictable:0 dirty:0 writeback:0 unstable:0
>  slab_reclaimable:5476 slab_unreclaimable:23140
>  mapped:389534 shmem:95 pagetables:20927 bounce:0
>  free:6948 free_pcp:222 free_cma:0
> Node 0 active_anon:107048kB inactive_anon:380kB active_file:170008kB inactive_file:1389752kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1558136kB dirty:0kB writeback:0kB shmem:0kB shmem_$
> hp: 0kB shmem_pmdmapped: 0kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB pages_scanned:4697206 all_unreclaimable? yes
> Node 0 DMA free:2168kB min:204kB low:252kB high:300kB active_anon:3544kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB sl$
> b_reclaimable:0kB slab_unreclaimable:2684kB kernel_stack:1760kB pagetables:3092kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 493 493 1955

Zone DMA is unusable

> Node 0 DMA32 free:6508kB min:6492kB low:8112kB high:9732kB active_anon:81264kB inactive_anon:0kB active_file:101204kB inactive_file:228kB unevictable:0kB writepending:0kB present:2080632kB managed:508584k$
>  mlocked:0kB slab_reclaimable:21904kB slab_unreclaimable:89876kB kernel_stack:46400kB pagetables:80616kB bounce:0kB free_pcp:544kB local_pcp:120kB free_cma:0kB
> lowmem_reserve[]: 0 0 0 1462

Zone DMA32 has reclaimable pages but not very many and they are active. It's
at the min watemark. The pgdat is unreclaimable indicating that scans
are high which implies that the active file pages are due to genuine
activations.

> Node 0 Movable free:19116kB min:19256kB low:24068kB high:28880kB active_anon:22240kB inactive_anon:380kB active_file:68812kB inactive_file:1389688kB unevictable:0kB writepending:0kB present:1535864kB mana$
> ed:1500964kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:368kB local_pcp:0kB free_cma:0kB

Zone Movable has reclaimable pages but it's at the min watermark and
scanning aggressively.

As the failing allocation can use all allocations, this appears to be close
to a genuine OOM case. Whether it survives is down to timing of when OOM
is triggered and whether the forked process exits in time or not.

To some extent, it could be "addressed" by immediately reclaiming active
pages moving to the inactive list at the cost of distorting page age for a
workload that is genuinely close to OOM. That is similar to what zone-lru
ended up doing -- fast reclaiming young pages from a zone.

> > Optionally please test without patch 5 if an OOM occurs.
> 
> Here goes without patch 5.
> 

Causing OOM detection to be delayed. Observations on the OOM message
without patch 5 are similar.

Do you mind trying the following? In the patch there is a line

scan += list_empty(src) ? total_skipped : total_skipped >> 2;

Try 

scan += list_empty(src) ? total_skipped : total_skipped >> 3;
scan += list_empty(src) ? total_skipped : total_skipped >> 4;
scan += total_skipped >> 4;

Each line slows the rate that OOM is detected but it'll be somewhat
specific to your test case as it's relying to fork to exit before OOM is
fired.

A hackier option that is also related to the fact fork is a major source
of the OOM triggering is to increase the zone reserve. That would give
more space for the fork bomb while giving the file reader slightly less
memory to work with. Again, what this is doing is simply altering OOM
timing because indications are the stress workload is genuinely close to
OOM.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 08ae8b0ef5c5..cedc8113c7a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -201,9 +201,9 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
 	 256,
 #endif
 #ifdef CONFIG_HIGHMEM
-	 32,
+	 8,
 #endif
-	 32,
+	 8,
 };

 EXPORT_SYMBOL(totalram_pages);

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v2
  2016-07-26 12:50     ` Mel Gorman
@ 2016-07-28  6:44       ` Joonsoo Kim
  -1 siblings, 0 replies; 44+ messages in thread
From: Joonsoo Kim @ 2016-07-28  6:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Minchan Kim, Michal Hocko,
	Vlastimil Babka, Linux-MM, LKML

On Tue, Jul 26, 2016 at 01:50:50PM +0100, Mel Gorman wrote:
> On Tue, Jul 26, 2016 at 05:11:30PM +0900, Joonsoo Kim wrote:
> > > These patches did not OOM for me on a 2G 32-bit KVM instance while running
> > > a stress test for an hour. Preliminary tests on a 64-bit system using a
> > > parallel dd workload did not show anything alarming.
> > > 
> > > If an OOM is detected then please post the full OOM message.
> > 
> > Before attaching OOM message, I should note that my test case also triggers
> > OOM in old kernel if there are four parallel file-readers. With node-lru and
> > patch 1~5, OOM is triggered even if there are one or more parallel file-readers.
> > With node-lru and patch 1~4, OOM is triggered if there are two or more
> > parallel file-readers.
> > 
> 
> The key there is that patch 5 allows OOM to be detected quicker. The fork
> workload exits after some time so it's inherently a race to see if the
> forked process exits before OOM is triggered or not.
> 
> > <SNIP>
> > Mem-Info:
> > active_anon:26762 inactive_anon:95 isolated_anon:0
> >  active_file:42543 inactive_file:347438 isolated_file:0
> >  unevictable:0 dirty:0 writeback:0 unstable:0
> >  slab_reclaimable:5476 slab_unreclaimable:23140
> >  mapped:389534 shmem:95 pagetables:20927 bounce:0
> >  free:6948 free_pcp:222 free_cma:0
> > Node 0 active_anon:107048kB inactive_anon:380kB active_file:170008kB inactive_file:1389752kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1558136kB dirty:0kB writeback:0kB shmem:0kB shmem_$
> > hp: 0kB shmem_pmdmapped: 0kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB pages_scanned:4697206 all_unreclaimable? yes
> > Node 0 DMA free:2168kB min:204kB low:252kB high:300kB active_anon:3544kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB sl$
> > b_reclaimable:0kB slab_unreclaimable:2684kB kernel_stack:1760kB pagetables:3092kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > lowmem_reserve[]: 0 493 493 1955
> 
> Zone DMA is unusable
> 
> > Node 0 DMA32 free:6508kB min:6492kB low:8112kB high:9732kB active_anon:81264kB inactive_anon:0kB active_file:101204kB inactive_file:228kB unevictable:0kB writepending:0kB present:2080632kB managed:508584k$
> >  mlocked:0kB slab_reclaimable:21904kB slab_unreclaimable:89876kB kernel_stack:46400kB pagetables:80616kB bounce:0kB free_pcp:544kB local_pcp:120kB free_cma:0kB
> > lowmem_reserve[]: 0 0 0 1462
> 
> Zone DMA32 has reclaimable pages but not very many and they are active. It's
> at the min watemark. The pgdat is unreclaimable indicating that scans
> are high which implies that the active file pages are due to genuine
> activations.
> 
> > Node 0 Movable free:19116kB min:19256kB low:24068kB high:28880kB active_anon:22240kB inactive_anon:380kB active_file:68812kB inactive_file:1389688kB unevictable:0kB writepending:0kB present:1535864kB mana$
> > ed:1500964kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:368kB local_pcp:0kB free_cma:0kB
> 
> Zone Movable has reclaimable pages but it's at the min watermark and
> scanning aggressively.
> 
> As the failing allocation can use all allocations, this appears to be close
> to a genuine OOM case. Whether it survives is down to timing of when OOM
> is triggered and whether the forked process exits in time or not.
>
> To some extent, it could be "addressed" by immediately reclaiming active
> pages moving to the inactive list at the cost of distorting page age for a
> workload that is genuinely close to OOM. That is similar to what zone-lru
> ended up doing -- fast reclaiming young pages from a zone.

My expectation on my test case is that reclaimers should kick out
actively used page and make a room for 'fork' because parallel readers
would work even if reading pages are not cached.

It is sensitive on reclaimers efficiency because parallel readers
read pages repeatedly and disturb reclaim. I thought that it is a
good test for node-lru which changes reclaimers efficiency for lower
zone. However, as you said, this efficiency comes from the cost
distorting page aging so now I'm not sure if it is a problem that we
need to consider. Let's skip it?

Anyway, thanks for tracking down the problem.


> 
> > > Optionally please test without patch 5 if an OOM occurs.
> > 
> > Here goes without patch 5.
> > 
> 
> Causing OOM detection to be delayed. Observations on the OOM message
> without patch 5 are similar.
> 
> Do you mind trying the following? In the patch there is a line
> 
> scan += list_empty(src) ? total_skipped : total_skipped >> 2;
> 
> Try 
> 
> scan += list_empty(src) ? total_skipped : total_skipped >> 3;
> scan += list_empty(src) ? total_skipped : total_skipped >> 4;
> scan += total_skipped >> 4;

Tested but all result looks like there isn't much difference.

> 
> Each line slows the rate that OOM is detected but it'll be somewhat
> specific to your test case as it's relying to fork to exit before OOM is
> fired.

Okay. I don't think optimizing general code to my specific test case
is a good idea.

Thanks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v2
@ 2016-07-28  6:44       ` Joonsoo Kim
  0 siblings, 0 replies; 44+ messages in thread
From: Joonsoo Kim @ 2016-07-28  6:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Minchan Kim, Michal Hocko,
	Vlastimil Babka, Linux-MM, LKML

On Tue, Jul 26, 2016 at 01:50:50PM +0100, Mel Gorman wrote:
> On Tue, Jul 26, 2016 at 05:11:30PM +0900, Joonsoo Kim wrote:
> > > These patches did not OOM for me on a 2G 32-bit KVM instance while running
> > > a stress test for an hour. Preliminary tests on a 64-bit system using a
> > > parallel dd workload did not show anything alarming.
> > > 
> > > If an OOM is detected then please post the full OOM message.
> > 
> > Before attaching OOM message, I should note that my test case also triggers
> > OOM in old kernel if there are four parallel file-readers. With node-lru and
> > patch 1~5, OOM is triggered even if there are one or more parallel file-readers.
> > With node-lru and patch 1~4, OOM is triggered if there are two or more
> > parallel file-readers.
> > 
> 
> The key there is that patch 5 allows OOM to be detected quicker. The fork
> workload exits after some time so it's inherently a race to see if the
> forked process exits before OOM is triggered or not.
> 
> > <SNIP>
> > Mem-Info:
> > active_anon:26762 inactive_anon:95 isolated_anon:0
> >  active_file:42543 inactive_file:347438 isolated_file:0
> >  unevictable:0 dirty:0 writeback:0 unstable:0
> >  slab_reclaimable:5476 slab_unreclaimable:23140
> >  mapped:389534 shmem:95 pagetables:20927 bounce:0
> >  free:6948 free_pcp:222 free_cma:0
> > Node 0 active_anon:107048kB inactive_anon:380kB active_file:170008kB inactive_file:1389752kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1558136kB dirty:0kB writeback:0kB shmem:0kB shmem_$
> > hp: 0kB shmem_pmdmapped: 0kB anon_thp: 380kB writeback_tmp:0kB unstable:0kB pages_scanned:4697206 all_unreclaimable? yes
> > Node 0 DMA free:2168kB min:204kB low:252kB high:300kB active_anon:3544kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB sl$
> > b_reclaimable:0kB slab_unreclaimable:2684kB kernel_stack:1760kB pagetables:3092kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > lowmem_reserve[]: 0 493 493 1955
> 
> Zone DMA is unusable
> 
> > Node 0 DMA32 free:6508kB min:6492kB low:8112kB high:9732kB active_anon:81264kB inactive_anon:0kB active_file:101204kB inactive_file:228kB unevictable:0kB writepending:0kB present:2080632kB managed:508584k$
> >  mlocked:0kB slab_reclaimable:21904kB slab_unreclaimable:89876kB kernel_stack:46400kB pagetables:80616kB bounce:0kB free_pcp:544kB local_pcp:120kB free_cma:0kB
> > lowmem_reserve[]: 0 0 0 1462
> 
> Zone DMA32 has reclaimable pages but not very many and they are active. It's
> at the min watemark. The pgdat is unreclaimable indicating that scans
> are high which implies that the active file pages are due to genuine
> activations.
> 
> > Node 0 Movable free:19116kB min:19256kB low:24068kB high:28880kB active_anon:22240kB inactive_anon:380kB active_file:68812kB inactive_file:1389688kB unevictable:0kB writepending:0kB present:1535864kB mana$
> > ed:1500964kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:368kB local_pcp:0kB free_cma:0kB
> 
> Zone Movable has reclaimable pages but it's at the min watermark and
> scanning aggressively.
> 
> As the failing allocation can use all allocations, this appears to be close
> to a genuine OOM case. Whether it survives is down to timing of when OOM
> is triggered and whether the forked process exits in time or not.
>
> To some extent, it could be "addressed" by immediately reclaiming active
> pages moving to the inactive list at the cost of distorting page age for a
> workload that is genuinely close to OOM. That is similar to what zone-lru
> ended up doing -- fast reclaiming young pages from a zone.

My expectation on my test case is that reclaimers should kick out
actively used page and make a room for 'fork' because parallel readers
would work even if reading pages are not cached.

It is sensitive on reclaimers efficiency because parallel readers
read pages repeatedly and disturb reclaim. I thought that it is a
good test for node-lru which changes reclaimers efficiency for lower
zone. However, as you said, this efficiency comes from the cost
distorting page aging so now I'm not sure if it is a problem that we
need to consider. Let's skip it?

Anyway, thanks for tracking down the problem.


> 
> > > Optionally please test without patch 5 if an OOM occurs.
> > 
> > Here goes without patch 5.
> > 
> 
> Causing OOM detection to be delayed. Observations on the OOM message
> without patch 5 are similar.
> 
> Do you mind trying the following? In the patch there is a line
> 
> scan += list_empty(src) ? total_skipped : total_skipped >> 2;
> 
> Try 
> 
> scan += list_empty(src) ? total_skipped : total_skipped >> 3;
> scan += list_empty(src) ? total_skipped : total_skipped >> 4;
> scan += total_skipped >> 4;

Tested but all result looks like there isn't much difference.

> 
> Each line slows the rate that OOM is detected but it'll be somewhat
> specific to your test case as it's relying to fork to exit before OOM is
> fired.

Okay. I don't think optimizing general code to my specific test case
is a good idea.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v2
  2016-07-28  6:44       ` Joonsoo Kim
@ 2016-07-28 10:27         ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2016-07-28 10:27 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Johannes Weiner, Minchan Kim, Michal Hocko,
	Vlastimil Babka, Linux-MM, LKML

On Thu, Jul 28, 2016 at 03:44:33PM +0900, Joonsoo Kim wrote:
> > To some extent, it could be "addressed" by immediately reclaiming active
> > pages moving to the inactive list at the cost of distorting page age for a
> > workload that is genuinely close to OOM. That is similar to what zone-lru
> > ended up doing -- fast reclaiming young pages from a zone.
> 
> My expectation on my test case is that reclaimers should kick out
> actively used page and make a room for 'fork' because parallel readers
> would work even if reading pages are not cached.
> 
> It is sensitive on reclaimers efficiency because parallel readers
> read pages repeatedly and disturb reclaim. I thought that it is a
> good test for node-lru which changes reclaimers efficiency for lower
> zone. However, as you said, this efficiency comes from the cost
> distorting page aging so now I'm not sure if it is a problem that we
> need to consider. Let's skip it?
> 

I think we should skip it for now. The alterations are too specific to a
test case that is very close to being genuinely OOM. Adjusting timing
for one OOM case may just lead to complains that OOM is detected too
slowly in others.

> Anyway, thanks for tracking down the problem.
> 

My pleasure, thanks to both you and Minchan for persisting with this as
we got some important fixes out of the discussion.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v2
@ 2016-07-28 10:27         ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2016-07-28 10:27 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Johannes Weiner, Minchan Kim, Michal Hocko,
	Vlastimil Babka, Linux-MM, LKML

On Thu, Jul 28, 2016 at 03:44:33PM +0900, Joonsoo Kim wrote:
> > To some extent, it could be "addressed" by immediately reclaiming active
> > pages moving to the inactive list at the cost of distorting page age for a
> > workload that is genuinely close to OOM. That is similar to what zone-lru
> > ended up doing -- fast reclaiming young pages from a zone.
> 
> My expectation on my test case is that reclaimers should kick out
> actively used page and make a room for 'fork' because parallel readers
> would work even if reading pages are not cached.
> 
> It is sensitive on reclaimers efficiency because parallel readers
> read pages repeatedly and disturb reclaim. I thought that it is a
> good test for node-lru which changes reclaimers efficiency for lower
> zone. However, as you said, this efficiency comes from the cost
> distorting page aging so now I'm not sure if it is a problem that we
> need to consider. Let's skip it?
> 

I think we should skip it for now. The alterations are too specific to a
test case that is very close to being genuinely OOM. Adjusting timing
for one OOM case may just lead to complains that OOM is detected too
slowly in others.

> Anyway, thanks for tracking down the problem.
> 

My pleasure, thanks to both you and Minchan for persisting with this as
we got some important fixes out of the discussion.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2016-07-28 10:28 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-21 14:10 [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v2 Mel Gorman
2016-07-21 14:10 ` Mel Gorman
2016-07-21 14:10 ` [PATCH 1/5] mm: add per-zone lru list stat Mel Gorman
2016-07-21 14:10   ` Mel Gorman
2016-07-22 15:51   ` Johannes Weiner
2016-07-22 15:51     ` Johannes Weiner
2016-07-21 14:10 ` [PATCH 2/5] mm, vmscan: Remove highmem_file_pages Mel Gorman
2016-07-21 14:10   ` Mel Gorman
2016-07-22 15:53   ` Johannes Weiner
2016-07-22 15:53     ` Johannes Weiner
2016-07-25  8:09   ` Minchan Kim
2016-07-25  8:09     ` Minchan Kim
2016-07-25  9:23     ` [PATCH] mm, vmscan: remove highmem_file_pages -fix Mel Gorman
2016-07-25  9:23       ` Mel Gorman
2016-07-21 14:10 ` [PATCH 3/5] mm: Remove reclaim and compaction retry approximations Mel Gorman
2016-07-21 14:10   ` Mel Gorman
2016-07-22 15:57   ` Johannes Weiner
2016-07-22 15:57     ` Johannes Weiner
2016-07-25  8:18   ` Minchan Kim
2016-07-25  8:18     ` Minchan Kim
2016-07-21 14:11 ` [PATCH 4/5] mm: consider per-zone inactive ratio to deactivate Mel Gorman
2016-07-21 14:11   ` Mel Gorman
2016-07-21 15:52   ` Johannes Weiner
2016-07-21 15:52     ` Johannes Weiner
2016-07-21 14:11 ` [PATCH 5/5] mm, vmscan: Account for skipped pages as a partial scan Mel Gorman
2016-07-21 14:11   ` Mel Gorman
2016-07-22 16:02   ` Johannes Weiner
2016-07-22 16:02     ` Johannes Weiner
2016-07-25  8:39   ` Minchan Kim
2016-07-25  8:39     ` Minchan Kim
2016-07-25  9:52     ` Mel Gorman
2016-07-25  9:52       ` Mel Gorman
2016-07-26  8:16   ` Joonsoo Kim
2016-07-26  8:16     ` Joonsoo Kim
2016-07-26  8:26     ` Joonsoo Kim
2016-07-26  8:26       ` Joonsoo Kim
2016-07-26  8:11 ` [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v2 Joonsoo Kim
2016-07-26  8:11   ` Joonsoo Kim
2016-07-26 12:50   ` Mel Gorman
2016-07-26 12:50     ` Mel Gorman
2016-07-28  6:44     ` Joonsoo Kim
2016-07-28  6:44       ` Joonsoo Kim
2016-07-28 10:27       ` Mel Gorman
2016-07-28 10:27         ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.