[PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v1

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v1
@ 2016-07-20 15:21 Mel Gorman
  2016-07-20 15:21 ` [PATCH 1/5] mm, vmscan: Do not account skipped pages as scanned Mel Gorman
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Mel Gorman @ 2016-07-20 15:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML, Mel Gorman

Both Joonsoo Kim and Minchan Kim have reported premature OOM kills on
a 32-bit platform. The common element is a zone-constrained high-order
allocation failing. Two factors appear to be at fault -- pgdat being
considered unreclaimable prematurely and insufficient rotation of the
active list.

Unfortunately to date I have been unable to reproduce this with a variety
of stress workloads on a 2G 32-bit KVM instance. It's not clear why as
the steps are similar to what was described. It means I've been unable to
determine if this series addresses the problem or not. I'm hoping they can
test and report back before these are merged to mmotm. What I have checked
is that a basic parallel DD workload completed successfully on the same
machine I used for the node-lru performance tests. I'll leave the other
tests running just in case anything interesting falls out.

The series is in three basic parts;

Patch 1 does not account for skipped pages as scanned. This avoids the pgdat
	being prematurely marked unreclaimable

Patches 2-4 add per-zone stats back in. The actual stats patch is different
	to Minchan's as the original patch did not account for unevictable
	LRU which would corrupt counters. The second two patches remove
	approximations based on pgdat statistics. It's effectively a
	revert of "mm, vmstat: remove zone and node double accounting by
	approximating retries" but different LRU stats are used. This
	is better than a full revert or a reworking of the series as
	it preserves history of why the zone stats are necessary.

	If this work out, we may have to leave the double accounting in
	place for now until an alternative cheap solution presents itself.

Patch 5 rotates inactive/active lists for lowmem allocations. This is also
	quite different to Minchan's patch as the original patch did not
	account for memcg and would rotate if *any* eligible zone needed
	rotation which may rotate excessively. The new patch considers
	the ratio for all eligible zones which is more in line with
	node-lru in general.

 include/linux/mm_inline.h | 19 ++-------------
 include/linux/mmzone.h    |  7 ++++++
 include/linux/swap.h      |  1 +
 mm/compaction.c           | 20 +---------------
 mm/migrate.c              |  2 ++
 mm/page-writeback.c       | 17 +++++++-------
 mm/page_alloc.c           | 59 ++++++++++++++++------------------------------
 mm/vmscan.c               | 60 ++++++++++++++++++++++++++++++++++++++++++-----
 mm/vmstat.c               |  6 +++++
 9 files changed, 102 insertions(+), 89 deletions(-)

-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/5] mm, vmscan: Do not account skipped pages as scanned
  2016-07-20 15:21 [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v1 Mel Gorman
@ 2016-07-20 15:21 ` Mel Gorman
  2016-07-21  5:16   ` Minchan Kim
  2016-07-25  8:04   ` Minchan Kim
  2016-07-20 15:21 ` [PATCH 2/5] mm: add per-zone lru list stat Mel Gorman
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Mel Gorman @ 2016-07-20 15:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML, Mel Gorman

Page reclaim determines whether a pgdat is unreclaimable by examining how
many pages have been scanned since a page was freed and comparing that
to the LRU sizes. Skipped pages are not considered reclaim candidates but
contribute to scanned. This can prematurely mark a pgdat as unreclaimable
and trigger an OOM kill.

While this does not fix an OOM kill message reported by Joonsoo Kim,
it did stop pgdat being marked unreclaimable.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 22aec2bcfeec..b16d578ce556 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1415,7 +1415,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	LIST_HEAD(pages_skipped);
 
 	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
-					!list_empty(src); scan++) {
+					!list_empty(src);) {
 		struct page *page;
 
 		page = lru_to_page(src);
@@ -1429,6 +1429,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 			continue;
 		}
 
+		/* Pages skipped do not contribute to scan */
+		scan++;
+
 		switch (__isolate_lru_page(page, mode)) {
 		case 0:
 			nr_pages = hpage_nr_pages(page);
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/5] mm: add per-zone lru list stat
  2016-07-20 15:21 [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v1 Mel Gorman
  2016-07-20 15:21 ` [PATCH 1/5] mm, vmscan: Do not account skipped pages as scanned Mel Gorman
@ 2016-07-20 15:21 ` Mel Gorman
  2016-07-21  7:10   ` Joonsoo Kim
  2016-07-20 15:21 ` [PATCH 3/5] mm, vmscan: Remove highmem_file_pages Mel Gorman
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2016-07-20 15:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML, Mel Gorman

From: Minchan Kim <minchan@kernel.org>

While I did stress test with hackbench, I got OOM message frequently which
didn't ever happen in zone-lru.

gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
..
..
 [<c71a76e2>] __alloc_pages_nodemask+0xe52/0xe60
 [<c71f31dc>] ? new_slab+0x39c/0x3b0
 [<c71f31dc>] new_slab+0x39c/0x3b0
 [<c71f4eca>] ___slab_alloc.constprop.87+0x6da/0x840
 [<c763e6fc>] ? __alloc_skb+0x3c/0x260
 [<c777e127>] ? _raw_spin_unlock_irq+0x27/0x60
 [<c70cebfc>] ? trace_hardirqs_on_caller+0xec/0x1b0
 [<c70a1506>] ? finish_task_switch+0xa6/0x220
 [<c7219ee0>] ? poll_select_copy_remaining+0x140/0x140
 [<c7201645>] __slab_alloc.isra.81.constprop.86+0x40/0x6d
 [<c763e6fc>] ? __alloc_skb+0x3c/0x260
 [<c71f525c>] kmem_cache_alloc+0x22c/0x260
 [<c763e6fc>] ? __alloc_skb+0x3c/0x260
 [<c763e6fc>] __alloc_skb+0x3c/0x260
 [<c763eece>] alloc_skb_with_frags+0x4e/0x1a0
 [<c7638d6a>] sock_alloc_send_pskb+0x16a/0x1b0
 [<c770b581>] ? wait_for_unix_gc+0x31/0x90
 [<c71cfb1d>] ? alloc_set_pte+0x2ad/0x310
 [<c77084dd>] unix_stream_sendmsg+0x28d/0x340
 [<c7634dad>] sock_sendmsg+0x2d/0x40
 [<c7634e2c>] sock_write_iter+0x6c/0xc0
 [<c7204a90>] __vfs_write+0xc0/0x120
 [<c72053ab>] vfs_write+0x9b/0x1a0
 [<c71cc4a9>] ? __might_fault+0x49/0xa0
 [<c72062c4>] SyS_write+0x44/0x90
 [<c70036c6>] do_fast_syscall_32+0xa6/0x1e0
 [<c777ea2c>] sysenter_past_esp+0x45/0x74

Mem-Info:
active_anon:104698 inactive_anon:105791 isolated_anon:192
 active_file:433 inactive_file:283 isolated_file:22
 unevictable:0 dirty:0 writeback:296 unstable:0
 slab_reclaimable:6389 slab_unreclaimable:78927
 mapped:474 shmem:0 pagetables:101426 bounce:0
 free:10518 free_pcp:334 free_cma:0
Node 0 active_anon:418792kB inactive_anon:423164kB active_file:1732kB inactive_file:1132kB unevictable:0kB isolated(anon):768kB isolated(file):88kB mapped:1896kB dirty:0kB writeback:1184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1478632 all_unreclaimable? yes
DMA free:3304kB min:68kB low:84kB high:100kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:4088kB kernel_stack:0kB pagetables:2480kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 809 1965 1965
Normal free:3436kB min:3604kB low:4504kB high:5404kB present:897016kB managed:858460kB mlocked:0kB slab_reclaimable:25556kB slab_unreclaimable:311712kB kernel_stack:164608kB pagetables:30844kB bounce:0kB free_pcp:620kB local_pcp:104kB free_cma:0kB
lowmem_reserve[]: 0 0 9247 9247
HighMem free:33808kB min:512kB low:1796kB high:3080kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:372252kB bounce:0kB free_pcp:428kB local_pcp:72kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
DMA: 2*4kB (UM) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 0*1024kB 1*2048kB (U) 0*4096kB = 3192kB
Normal: 33*4kB (MH) 79*8kB (ME) 11*16kB (M) 4*32kB (M) 2*64kB (ME) 2*128kB (EH) 7*256kB (EH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3244kB
HighMem: 2590*4kB (UM) 1568*8kB (UM) 491*16kB (UM) 60*32kB (UM) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33064kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
25121 total pagecache pages
24160 pages in swap cache
Swap cache stats: add 86371, delete 62211, find 42865/60187
Free swap  = 4015560kB
Total swap = 4192252kB
524186 pages RAM
295934 pages HighMem/MovableOnly
9658 pages reserved
0 pages cma reserved

The order-0 allocation for normal zone failed while there are a lot of
reclaimable memory(i.e., anonymous memory with free swap). I wanted to
analyze the problem but it was hard because we removed per-zone lru stat
so I couldn't know how many of anonymous memory there are in normal/dma zone.

When we investigate OOM problem, reclaimable memory count is crucial stat
to find a problem. Without it, it's hard to parse the OOM message so I
believe we should keep it.

With per-zone lru stat,

gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
Mem-Info:
active_anon:101103 inactive_anon:102219 isolated_anon:0
 active_file:503 inactive_file:544 isolated_file:0
 unevictable:0 dirty:0 writeback:34 unstable:0
 slab_reclaimable:6298 slab_unreclaimable:74669
 mapped:863 shmem:0 pagetables:100998 bounce:0
 free:23573 free_pcp:1861 free_cma:0
Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 809 1965 1965
Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
lowmem_reserve[]: 0 0 9247 9247
HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
54409 total pagecache pages
53215 pages in swap cache
Swap cache stats: add 300982, delete 247765, find 157978/226539
Free swap  = 3803244kB
Total swap = 4192252kB
524186 pages RAM
295934 pages HighMem/MovableOnly
9642 pages reserved
0 pages cma reserved

With that, we can see normal zone has a 86M reclaimable memory so we can
know something goes wrong(I will fix the problem in next patch) in reclaim.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mm_inline.h |  2 ++
 include/linux/mmzone.h    |  6 ++++++
 mm/page_alloc.c           | 10 ++++++++++
 mm/vmstat.c               |  5 +++++
 4 files changed, 23 insertions(+)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index bcc4ed07fa90..9cc130f5feb2 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -45,6 +45,8 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
 	__mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
+	__mod_zone_page_state(&pgdat->node_zones[zid],
+				NR_ZONE_LRU_BASE + lru, nr_pages);
 	acct_highmem_file_pages(zid, lru, nr_pages);
 }
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e6aca07cedb7..72625b04e9ba 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -110,6 +110,12 @@ struct zone_padding {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
+	NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
+	NR_ZONE_INACTIVE_ANON = NR_ZONE_LRU_BASE,
+	NR_ZONE_ACTIVE_ANON,
+	NR_ZONE_INACTIVE_FILE,
+	NR_ZONE_ACTIVE_FILE,
+	NR_ZONE_UNEVICTABLE,
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 830ad49a584a..b44c9a8d879a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4388,6 +4388,11 @@ void show_free_areas(unsigned int filter)
 			" min:%lukB"
 			" low:%lukB"
 			" high:%lukB"
+			" active_anon:%lukB"
+			" inactive_anon:%lukB"
+			" active_file:%lukB"
+			" inactive_file:%lukB"
+			" unevictable:%lukB"
 			" present:%lukB"
 			" managed:%lukB"
 			" mlocked:%lukB"
@@ -4405,6 +4410,11 @@ void show_free_areas(unsigned int filter)
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
+			K(zone_page_state(zone, NR_ZONE_ACTIVE_ANON)),
+			K(zone_page_state(zone, NR_ZONE_INACTIVE_ANON)),
+			K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)),
+			K(zone_page_state(zone, NR_ZONE_INACTIVE_FILE)),
+			K(zone_page_state(zone, NR_ZONE_UNEVICTABLE)),
 			K(zone->present_pages),
 			K(zone->managed_pages),
 			K(zone_page_state(zone, NR_MLOCK)),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 91ecca96dcae..f10aad81a9a3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -921,6 +921,11 @@ int fragmentation_index(struct zone *zone, unsigned int order)
 const char * const vmstat_text[] = {
 	/* enum zone_stat_item countes */
 	"nr_free_pages",
+	"nr_inactive_anon",
+	"nr_active_anon",
+	"nr_inactive_file",
+	"nr_active_file",
+	"nr_unevictable",
 	"nr_mlock",
 	"nr_slab_reclaimable",
 	"nr_slab_unreclaimable",
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/5] mm, vmscan: Remove highmem_file_pages
  2016-07-20 15:21 [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v1 Mel Gorman
  2016-07-20 15:21 ` [PATCH 1/5] mm, vmscan: Do not account skipped pages as scanned Mel Gorman
  2016-07-20 15:21 ` [PATCH 2/5] mm: add per-zone lru list stat Mel Gorman
@ 2016-07-20 15:21 ` Mel Gorman
  2016-07-20 15:21 ` [PATCH 4/5] mm: Remove reclaim and compaction retry approximations Mel Gorman
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2016-07-20 15:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML, Mel Gorman

With the reintroduction of per-zone LRU stats, highmem_file_pages is
redundant so remove it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mm_inline.h | 17 -----------------
 mm/page-writeback.c       | 12 ++++--------
 2 files changed, 4 insertions(+), 25 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 9cc130f5feb2..71613e8a720f 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -4,22 +4,6 @@
 #include <linux/huge_mm.h>
 #include <linux/swap.h>
 
-#ifdef CONFIG_HIGHMEM
-extern atomic_t highmem_file_pages;
-
-static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
-							int nr_pages)
-{
-	if (is_highmem_idx(zid) && is_file_lru(lru))
-		atomic_add(nr_pages, &highmem_file_pages);
-}
-#else
-static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
-							int nr_pages)
-{
-}
-#endif
-
 /**
  * page_is_file_cache - should the page be on a file LRU or anon LRU?
  * @page: the page to test
@@ -47,7 +31,6 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
 	__mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
 	__mod_zone_page_state(&pgdat->node_zones[zid],
 				NR_ZONE_LRU_BASE + lru, nr_pages);
-	acct_highmem_file_pages(zid, lru, nr_pages);
 }
 
 static __always_inline void update_lru_size(struct lruvec *lruvec,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 573d138fa7a5..cfa78124c3c2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -299,17 +299,13 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat)
 
 	return nr_pages;
 }
-#ifdef CONFIG_HIGHMEM
-atomic_t highmem_file_pages;
-#endif
 
 static unsigned long highmem_dirtyable_memory(unsigned long total)
 {
 #ifdef CONFIG_HIGHMEM
 	int node;
-	unsigned long x;
+	unsigned long x = 0;
 	int i;
-	unsigned long dirtyable = 0;
 
 	for_each_node_state(node, N_HIGH_MEMORY) {
 		for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) {
@@ -326,12 +322,12 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
 			nr_pages = zone_page_state(z, NR_FREE_PAGES);
 			/* watch for underflows */
 			nr_pages -= min(nr_pages, high_wmark_pages(z));
-			dirtyable += nr_pages;
+			nr_pages += zone_page_state(z, NR_INACTIVE_FILE);
+			nr_pages += zone_page_state(z, NR_ACTIVE_FILE);
+			x += nr_pages;
 		}
 	}
 
-	x = dirtyable + atomic_read(&highmem_file_pages);
-
 	/*
 	 * Unreclaimable memory (kernel memory or anonymous memory
 	 * without swap) can bring down the dirtyable pages below
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/5] mm: Remove reclaim and compaction retry approximations
  2016-07-20 15:21 [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v1 Mel Gorman
                   ` (2 preceding siblings ...)
  2016-07-20 15:21 ` [PATCH 3/5] mm, vmscan: Remove highmem_file_pages Mel Gorman
@ 2016-07-20 15:21 ` Mel Gorman
  2016-07-20 15:21 ` [PATCH 5/5] mm: consider per-zone inactive ratio to deactivate Mel Gorman
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2016-07-20 15:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML, Mel Gorman

If per-zone LRU accounting is available then there is no point
approximating whether reclaim and compaction should retry based on pgdat
statistics. This is effectively a revert of "mm, vmstat: remove zone and
node double accounting by approximating retries" with the difference that
inactive/active stats are still available. This preserves the history of
why the approximation was retried and why it had to be reverted to handle
OOM kills on 32-bit systems.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h |  1 +
 include/linux/swap.h   |  1 +
 mm/compaction.c        | 20 +-------------------
 mm/migrate.c           |  2 ++
 mm/page-writeback.c    |  5 +++++
 mm/page_alloc.c        | 49 ++++++++++---------------------------------------
 mm/vmscan.c            | 18 ++++++++++++++++++
 mm/vmstat.c            |  1 +
 8 files changed, 39 insertions(+), 58 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 72625b04e9ba..f2e4e90621ec 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -116,6 +116,7 @@ enum zone_stat_item {
 	NR_ZONE_INACTIVE_FILE,
 	NR_ZONE_ACTIVE_FILE,
 	NR_ZONE_UNEVICTABLE,
+	NR_ZONE_WRITE_PENDING,	/* Count of dirty, writeback and unstable pages */
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index cc753c639e3d..b17cc4830fa6 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -307,6 +307,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
+extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff --git a/mm/compaction.c b/mm/compaction.c
index cd93ea24c565..e5995f38d677 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1438,11 +1438,6 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 {
 	struct zone *zone;
 	struct zoneref *z;
-	pg_data_t *last_pgdat = NULL;
-
-	/* Do not retry compaction for zone-constrained allocations */
-	if (ac->high_zoneidx < ZONE_NORMAL)
-		return false;
 
 	/*
 	 * Make sure at least one zone would pass __compaction_suitable if we continue
@@ -1453,27 +1448,14 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 		unsigned long available;
 		enum compact_result compact_result;
 
-		if (last_pgdat == zone->zone_pgdat)
-			continue;
-
-		/*
-		 * This over-estimates the number of pages available for
-		 * reclaim/compaction but walking the LRU would take too
-		 * long. The consequences are that compaction may retry
-		 * longer than it should for a zone-constrained allocation
-		 * request.
-		 */
-		last_pgdat = zone->zone_pgdat;
-		available = pgdat_reclaimable_pages(zone->zone_pgdat) / order;
-
 		/*
 		 * Do not consider all the reclaimable memory because we do not
 		 * want to trash just for a single high order allocation which
 		 * is even not guaranteed to appear even if __compaction_suitable
 		 * is happy about the watermark check.
 		 */
+		available = zone_reclaimable_pages(zone) / order;
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
-		available = min(zone->managed_pages, available);
 		compact_result = __compaction_suitable(zone, order, alloc_flags,
 				ac_classzone_idx(ac), available);
 		if (compact_result != COMPACT_SKIPPED &&
diff --git a/mm/migrate.c b/mm/migrate.c
index ed2f85e61de1..ed0268268e93 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -513,7 +513,9 @@ int migrate_page_move_mapping(struct address_space *mapping,
 		}
 		if (dirty && mapping_cap_account_dirty(mapping)) {
 			__dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY);
+			__dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING);
 			__inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY);
+			__inc_zone_state(newzone, NR_ZONE_WRITE_PENDING);
 		}
 	}
 	local_irq_enable();
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index cfa78124c3c2..7e9061ec040b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2462,6 +2462,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 		__inc_node_page_state(page, NR_FILE_DIRTY);
+		__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		__inc_node_page_state(page, NR_DIRTIED);
 		__inc_wb_stat(wb, WB_RECLAIMABLE);
 		__inc_wb_stat(wb, WB_DIRTIED);
@@ -2483,6 +2484,7 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
 	if (mapping_cap_account_dirty(mapping)) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 		dec_node_page_state(page, NR_FILE_DIRTY);
+		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		dec_wb_stat(wb, WB_RECLAIMABLE);
 		task_io_account_cancelled_write(PAGE_SIZE);
 	}
@@ -2739,6 +2741,7 @@ int clear_page_dirty_for_io(struct page *page)
 		if (TestClearPageDirty(page)) {
 			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 			dec_node_page_state(page, NR_FILE_DIRTY);
+			dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 			dec_wb_stat(wb, WB_RECLAIMABLE);
 			ret = 1;
 		}
@@ -2785,6 +2788,7 @@ int test_clear_page_writeback(struct page *page)
 	if (ret) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		dec_node_page_state(page, NR_WRITEBACK);
+		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		inc_node_page_state(page, NR_WRITTEN);
 	}
 	unlock_page_memcg(page);
@@ -2839,6 +2843,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 	if (!ret) {
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		inc_node_page_state(page, NR_WRITEBACK);
+		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 	}
 	unlock_page_memcg(page);
 	return ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b44c9a8d879a..afb254e22235 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3434,7 +3434,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 {
 	struct zone *zone;
 	struct zoneref *z;
-	pg_data_t *current_pgdat = NULL;
 
 	/*
 	 * Make sure we converge to OOM if we cannot make any progress
@@ -3444,15 +3443,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		return false;
 
 	/*
-	 * Blindly retry lowmem allocation requests that are often ignored by
-	 * the OOM killer up to MAX_RECLAIM_RETRIES as we not have a reliable
-	 * and fast means of calculating reclaimable, dirty and writeback pages
-	 * in eligible zones.
-	 */
-	if (ac->high_zoneidx < ZONE_NORMAL)
-		goto out;
-
-	/*
 	 * Keep reclaiming pages while there is a chance this will lead somewhere.
 	 * If none of the target zones can satisfy our allocation request even
 	 * if all reclaimable pages are considered then we are screwed and have
@@ -3462,38 +3452,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 					ac->nodemask) {
 		unsigned long available;
 		unsigned long reclaimable;
-		int zid;
 
-		if (current_pgdat == zone->zone_pgdat)
-			continue;
-
-		current_pgdat = zone->zone_pgdat;
-		available = reclaimable = pgdat_reclaimable_pages(current_pgdat);
+		available = reclaimable = zone_reclaimable_pages(zone);
 		available -= DIV_ROUND_UP(no_progress_loops * available,
 					  MAX_RECLAIM_RETRIES);
-
-		/* Account for all free pages on eligible zones */
-		for (zid = 0; zid <= zone_idx(zone); zid++) {
-			struct zone *acct_zone = &current_pgdat->node_zones[zid];
-
-			available += zone_page_state_snapshot(acct_zone, NR_FREE_PAGES);
-		}
+		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
 
 		/*
 		 * Would the allocation succeed if we reclaimed the whole
-		 * available? This is approximate because there is no
-		 * accurate count of reclaimable pages per zone.
+		 * available?
 		 */
-		for (zid = 0; zid <= zone_idx(zone); zid++) {
-			struct zone *check_zone = &current_pgdat->node_zones[zid];
-			unsigned long estimate;
-
-			estimate = min(check_zone->managed_pages, available);
-			if (!__zone_watermark_ok(check_zone, order,
-					min_wmark_pages(check_zone), ac_classzone_idx(ac),
-					alloc_flags, estimate))
-				continue;
-
+		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
+				ac_classzone_idx(ac), alloc_flags, available)) {
 			/*
 			 * If we didn't make any progress and have a lot of
 			 * dirty + writeback pages then we should wait for
@@ -3503,16 +3473,15 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 			if (!did_some_progress) {
 				unsigned long write_pending;
 
-				write_pending =
-					node_page_state(current_pgdat, NR_WRITEBACK) +
-					node_page_state(current_pgdat, NR_FILE_DIRTY);
+				write_pending = zone_page_state_snapshot(zone,
+							NR_ZONE_WRITE_PENDING);
 
 				if (2 * write_pending > reclaimable) {
 					congestion_wait(BLK_RW_ASYNC, HZ/10);
 					return true;
 				}
 			}
-out:
+
 			/*
 			 * Memory allocation/reclaim might be called from a WQ
 			 * context and the current implementation of the WQ
@@ -4393,6 +4362,7 @@ void show_free_areas(unsigned int filter)
 			" active_file:%lukB"
 			" inactive_file:%lukB"
 			" unevictable:%lukB"
+			" writepending:%lukB"
 			" present:%lukB"
 			" managed:%lukB"
 			" mlocked:%lukB"
@@ -4415,6 +4385,7 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)),
 			K(zone_page_state(zone, NR_ZONE_INACTIVE_FILE)),
 			K(zone_page_state(zone, NR_ZONE_UNEVICTABLE)),
+			K(zone_page_state(zone, NR_ZONE_WRITE_PENDING)),
 			K(zone->present_pages),
 			K(zone->managed_pages),
 			K(zone_page_state(zone, NR_MLOCK)),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b16d578ce556..8f5959469079 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -194,6 +194,24 @@ static bool sane_reclaim(struct scan_control *sc)
 }
 #endif
 
+/*
+ * This misses isolated pages which are not accounted for to save counters.
+ * As the data only determines if reclaim or compaction continues, it is
+ * not expected that isolated pages will be a dominating factor.
+ */
+unsigned long zone_reclaimable_pages(struct zone *zone)
+{
+	unsigned long nr;
+
+	nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
+		zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
+	if (get_nr_swap_pages() > 0)
+		nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
+			zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
+
+	return nr;
+}
+
 unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
 {
 	unsigned long nr;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f10aad81a9a3..e1a46906c61b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -926,6 +926,7 @@ const char * const vmstat_text[] = {
 	"nr_inactive_file",
 	"nr_active_file",
 	"nr_unevictable",
+	"nr_zone_write_pending",
 	"nr_mlock",
 	"nr_slab_reclaimable",
 	"nr_slab_unreclaimable",
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/5] mm: consider per-zone inactive ratio to deactivate
  2016-07-20 15:21 [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v1 Mel Gorman
                   ` (3 preceding siblings ...)
  2016-07-20 15:21 ` [PATCH 4/5] mm: Remove reclaim and compaction retry approximations Mel Gorman
@ 2016-07-20 15:21 ` Mel Gorman
  2016-07-21  5:30   ` Minchan Kim
  2016-07-21  7:10   ` Joonsoo Kim
  2016-07-21  7:07 ` [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v1 Minchan Kim
  2016-07-21  7:31 ` Joonsoo Kim
  6 siblings, 2 replies; 24+ messages in thread
From: Mel Gorman @ 2016-07-20 15:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Minchan Kim, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML, Mel Gorman

From: Minchan Kim <minchan@kernel.org>

Minchan Kim reported that with per-zone lru state it was possible to
identify that a normal zone with 8^M anonymous pages could trigger
OOM with non-atomic order-0 allocations as all pages in the zone
were in the active list.

   gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
   Call Trace:
    [<c51a76e2>] __alloc_pages_nodemask+0xe52/0xe60
    [<c51f31dc>] ? new_slab+0x39c/0x3b0
    [<c51f31dc>] new_slab+0x39c/0x3b0
    [<c51f4eca>] ___slab_alloc.constprop.87+0x6da/0x840
    [<c563e6fc>] ? __alloc_skb+0x3c/0x260
    [<c50b8e93>] ? enqueue_task_fair+0x73/0xbf0
    [<c5219ee0>] ? poll_select_copy_remaining+0x140/0x140
    [<c5201645>] __slab_alloc.isra.81.constprop.86+0x40/0x6d
    [<c563e6fc>] ? __alloc_skb+0x3c/0x260
    [<c51f525c>] kmem_cache_alloc+0x22c/0x260
    [<c563e6fc>] ? __alloc_skb+0x3c/0x260
    [<c563e6fc>] __alloc_skb+0x3c/0x260
    [<c563eece>] alloc_skb_with_frags+0x4e/0x1a0
    [<c5638d6a>] sock_alloc_send_pskb+0x16a/0x1b0
    [<c570b581>] ? wait_for_unix_gc+0x31/0x90
    [<c57084dd>] unix_stream_sendmsg+0x28d/0x340
    [<c5634dad>] sock_sendmsg+0x2d/0x40
    [<c5634e2c>] sock_write_iter+0x6c/0xc0
    [<c5204a90>] __vfs_write+0xc0/0x120
    [<c52053ab>] vfs_write+0x9b/0x1a0
    [<c51cc4a9>] ? __might_fault+0x49/0xa0
    [<c52062c4>] SyS_write+0x44/0x90
    [<c50036c6>] do_fast_syscall_32+0xa6/0x1e0

   Mem-Info:
   active_anon:101103 inactive_anon:102219 isolated_anon:0
    active_file:503 inactive_file:544 isolated_file:0
    unevictable:0 dirty:0 writeback:34 unstable:0
    slab_reclaimable:6298 slab_unreclaimable:74669
    mapped:863 shmem:0 pagetables:100998 bounce:0
    free:23573 free_pcp:1861 free_cma:0
   Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
   DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
   lowmem_reserve[]: 0 809 1965 1965
   Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
   lowmem_reserve[]: 0 0 9247 9247
   HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
   lowmem_reserve[]: 0 0 0 0
   DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
   Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
   HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
   Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
   54409 total pagecache pages
   53215 pages in swap cache
   Swap cache stats: add 300982, delete 247765, find 157978/226539
   Free swap  = 3803244kB
   Total swap = 4192252kB
   524186 pages RAM
   295934 pages HighMem/MovableOnly
   9642 pages reserved
   0 pages cma reserved

The problem is due to the active deactivation logic in inactive_list_is_low.

	Node 0 active_anon:404412kB inactive_anon:409040kB

IOW, (inactive_anon of node * inactive_ratio > active_anon of node) due to
highmem anonymous stat so VM never deactivates normal zone's anonymous pages.

This patch is a modified version of Minchan's original solution but based
upon it. The problem with Minchan's patch is that it didn't take memcg
into account and any low zone with an imbalanced list could force a rotation.

In this page, a zone-constrained global reclaim will rotate the list if
the inactive/active ratio of all eligible zones needs to be corrected. It
is possible that higher zone pages will be initially rotated prematurely
but this is the safer choice to maintain overall LRU age.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 37 ++++++++++++++++++++++++++++++++-----
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8f5959469079..dddf73f4293c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1976,7 +1976,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
  *    1TB     101        10GB
  *   10TB     320        32GB
  */
-static bool inactive_list_is_low(struct lruvec *lruvec, bool file)
+static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
+						struct scan_control *sc)
 {
 	unsigned long inactive_ratio;
 	unsigned long inactive;
@@ -1993,6 +1994,32 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file)
 	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
 	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
 
+	/*
+	 * For global reclaim on zone-constrained allocations, it is necessary
+	 * to check if rotations are required for lowmem to be reclaimed. This
+	 * calculates the inactive/active pages available in eligible zones.
+	 */
+	if (global_reclaim(sc)) {
+		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+		int zid;
+
+		for (zid = sc->reclaim_idx; zid < MAX_NR_ZONES; zid++) {
+			struct zone *zone = &pgdat->node_zones[zid];
+			unsigned long inactive_zone, active_zone;
+
+			if (!populated_zone(zone))
+				continue;
+
+			inactive_zone = zone_page_state(zone,
+					NR_ZONE_LRU_BASE + (file * LRU_FILE));
+			active_zone = zone_page_state(zone,
+					NR_ZONE_LRU_BASE + (file * LRU_FILE) + LRU_ACTIVE);
+
+			inactive -= min(inactive, inactive_zone);
+			active -= min(active, active_zone);
+		}
+	}
+
 	gb = (inactive + active) >> (30 - PAGE_SHIFT);
 	if (gb)
 		inactive_ratio = int_sqrt(10 * gb);
@@ -2006,7 +2033,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 				 struct lruvec *lruvec, struct scan_control *sc)
 {
 	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(lruvec, is_file_lru(lru)))
+		if (inactive_list_is_low(lruvec, is_file_lru(lru), sc))
 			shrink_active_list(nr_to_scan, lruvec, sc, lru);
 		return 0;
 	}
@@ -2137,7 +2164,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * lruvec even if it has plenty of old anonymous pages unless the
 	 * system is under heavy pressure.
 	 */
-	if (!inactive_list_is_low(lruvec, true) &&
+	if (!inactive_list_is_low(lruvec, true, sc) &&
 	    lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
 		scan_balance = SCAN_FILE;
 		goto out;
@@ -2379,7 +2406,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
 	 * Even if we did not try to evict anon pages at all, we want to
 	 * rebalance the anon lru active/inactive ratio.
 	 */
-	if (inactive_list_is_low(lruvec, false))
+	if (inactive_list_is_low(lruvec, false, sc))
 		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 				   sc, LRU_ACTIVE_ANON);
 
@@ -3032,7 +3059,7 @@ static void age_active_anon(struct pglist_data *pgdat,
 	do {
 		struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
 
-		if (inactive_list_is_low(lruvec, false))
+		if (inactive_list_is_low(lruvec, false, sc))
 			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 					   sc, LRU_ACTIVE_ANON);
 
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] mm, vmscan: Do not account skipped pages as scanned
  2016-07-20 15:21 ` [PATCH 1/5] mm, vmscan: Do not account skipped pages as scanned Mel Gorman
@ 2016-07-21  5:16   ` Minchan Kim
  2016-07-21  8:15     ` Mel Gorman
  2016-07-25  8:04   ` Minchan Kim
  1 sibling, 1 reply; 24+ messages in thread
From: Minchan Kim @ 2016-07-21  5:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Wed, Jul 20, 2016 at 04:21:47PM +0100, Mel Gorman wrote:
> Page reclaim determines whether a pgdat is unreclaimable by examining how
> many pages have been scanned since a page was freed and comparing that
> to the LRU sizes. Skipped pages are not considered reclaim candidates but
> contribute to scanned. This can prematurely mark a pgdat as unreclaimable
> and trigger an OOM kill.
> 
> While this does not fix an OOM kill message reported by Joonsoo Kim,
> it did stop pgdat being marked unreclaimable.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  mm/vmscan.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 22aec2bcfeec..b16d578ce556 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1415,7 +1415,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  	LIST_HEAD(pages_skipped);
>  
>  	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
> -					!list_empty(src); scan++) {
> +					!list_empty(src);) {
>  		struct page *page;
>  
>  		page = lru_to_page(src);
> @@ -1429,6 +1429,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  			continue;
>  		}
>  
> +		/* Pages skipped do not contribute to scan */

The comment should explain why.

/* Pages skipped do not contribute to scan to prevent premature OOM */


> +		scan++;
> +


The one of my concern about node-lru is to add more lru lock contetion
in multiple zone system so such unbounded skip scanning under the lock
should have a limit to prevent latency spike and serialization of
current reclaim work.

Another concern is big mismatch between the number of pages from list and
LRU stat count because lruvec_lru_size call sites don't take the stat
under the lock while isolate_lru_pages moves many pages from lru list
to temporal skipped list.


>  		switch (__isolate_lru_page(page, mode)) {
>  		case 0:
>  			nr_pages = hpage_nr_pages(page);
> -- 
> 2.6.4
> 
> -- 
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] mm: consider per-zone inactive ratio to deactivate
  2016-07-20 15:21 ` [PATCH 5/5] mm: consider per-zone inactive ratio to deactivate Mel Gorman
@ 2016-07-21  5:30   ` Minchan Kim
  2016-07-21  8:08     ` Mel Gorman
  2016-07-21  7:10   ` Joonsoo Kim
  1 sibling, 1 reply; 24+ messages in thread
From: Minchan Kim @ 2016-07-21  5:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

Hi Mel,

On Wed, Jul 20, 2016 at 04:21:51PM +0100, Mel Gorman wrote:
> From: Minchan Kim <minchan@kernel.org>
> 
> Minchan Kim reported that with per-zone lru state it was possible to
> identify that a normal zone with 8^M anonymous pages could trigger
> OOM with non-atomic order-0 allocations as all pages in the zone
> were in the active list.
> 
>    gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
>    Call Trace:
>     [<c51a76e2>] __alloc_pages_nodemask+0xe52/0xe60
>     [<c51f31dc>] ? new_slab+0x39c/0x3b0
>     [<c51f31dc>] new_slab+0x39c/0x3b0
>     [<c51f4eca>] ___slab_alloc.constprop.87+0x6da/0x840
>     [<c563e6fc>] ? __alloc_skb+0x3c/0x260
>     [<c50b8e93>] ? enqueue_task_fair+0x73/0xbf0
>     [<c5219ee0>] ? poll_select_copy_remaining+0x140/0x140
>     [<c5201645>] __slab_alloc.isra.81.constprop.86+0x40/0x6d
>     [<c563e6fc>] ? __alloc_skb+0x3c/0x260
>     [<c51f525c>] kmem_cache_alloc+0x22c/0x260
>     [<c563e6fc>] ? __alloc_skb+0x3c/0x260
>     [<c563e6fc>] __alloc_skb+0x3c/0x260
>     [<c563eece>] alloc_skb_with_frags+0x4e/0x1a0
>     [<c5638d6a>] sock_alloc_send_pskb+0x16a/0x1b0
>     [<c570b581>] ? wait_for_unix_gc+0x31/0x90
>     [<c57084dd>] unix_stream_sendmsg+0x28d/0x340
>     [<c5634dad>] sock_sendmsg+0x2d/0x40
>     [<c5634e2c>] sock_write_iter+0x6c/0xc0
>     [<c5204a90>] __vfs_write+0xc0/0x120
>     [<c52053ab>] vfs_write+0x9b/0x1a0
>     [<c51cc4a9>] ? __might_fault+0x49/0xa0
>     [<c52062c4>] SyS_write+0x44/0x90
>     [<c50036c6>] do_fast_syscall_32+0xa6/0x1e0
> 
>    Mem-Info:
>    active_anon:101103 inactive_anon:102219 isolated_anon:0
>     active_file:503 inactive_file:544 isolated_file:0
>     unevictable:0 dirty:0 writeback:34 unstable:0
>     slab_reclaimable:6298 slab_unreclaimable:74669
>     mapped:863 shmem:0 pagetables:100998 bounce:0
>     free:23573 free_pcp:1861 free_cma:0
>    Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
>    DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
>    lowmem_reserve[]: 0 809 1965 1965
>    Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
>    lowmem_reserve[]: 0 0 9247 9247
>    HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
>    lowmem_reserve[]: 0 0 0 0
>    DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
>    Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
>    HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
>    Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
>    54409 total pagecache pages
>    53215 pages in swap cache
>    Swap cache stats: add 300982, delete 247765, find 157978/226539
>    Free swap  = 3803244kB
>    Total swap = 4192252kB
>    524186 pages RAM
>    295934 pages HighMem/MovableOnly
>    9642 pages reserved
>    0 pages cma reserved
> 
> The problem is due to the active deactivation logic in inactive_list_is_low.
> 
> 	Node 0 active_anon:404412kB inactive_anon:409040kB
> 
> IOW, (inactive_anon of node * inactive_ratio > active_anon of node) due to
> highmem anonymous stat so VM never deactivates normal zone's anonymous pages.
> 
> This patch is a modified version of Minchan's original solution but based
> upon it. The problem with Minchan's patch is that it didn't take memcg
> into account and any low zone with an imbalanced list could force a rotation.

Could you explan why we should consider memcg here?

> 
> In this page, a zone-constrained global reclaim will rotate the list if

          patch,

> the inactive/active ratio of all eligible zones needs to be corrected. It
> is possible that higher zone pages will be initially rotated prematurely
> but this is the safer choice to maintain overall LRU age.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  mm/vmscan.c | 37 ++++++++++++++++++++++++++++++++-----
>  1 file changed, 32 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8f5959469079..dddf73f4293c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1976,7 +1976,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
>   *    1TB     101        10GB
>   *   10TB     320        32GB
>   */
> -static bool inactive_list_is_low(struct lruvec *lruvec, bool file)
> +static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
> +						struct scan_control *sc)
>  {
>  	unsigned long inactive_ratio;
>  	unsigned long inactive;
> @@ -1993,6 +1994,32 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file)
>  	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
>  	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
>  
> +	/*
> +	 * For global reclaim on zone-constrained allocations, it is necessary
> +	 * to check if rotations are required for lowmem to be reclaimed. This
> +	 * calculates the inactive/active pages available in eligible zones.
> +	 */
> +	if (global_reclaim(sc)) {
> +		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +		int zid;
> +
> +		for (zid = sc->reclaim_idx; zid < MAX_NR_ZONES; zid++) {
> +			struct zone *zone = &pgdat->node_zones[zid];
> +			unsigned long inactive_zone, active_zone;
> +
> +			if (!populated_zone(zone))
> +				continue;
> +
> +			inactive_zone = zone_page_state(zone,
> +					NR_ZONE_LRU_BASE + (file * LRU_FILE));
> +			active_zone = zone_page_state(zone,
> +					NR_ZONE_LRU_BASE + (file * LRU_FILE) + LRU_ACTIVE);
> +
> +			inactive -= min(inactive, inactive_zone);
> +			active -= min(active, active_zone);
> +		}
> +	}
> +
>  	gb = (inactive + active) >> (30 - PAGE_SHIFT);
>  	if (gb)
>  		inactive_ratio = int_sqrt(10 * gb);
> @@ -2006,7 +2033,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
>  				 struct lruvec *lruvec, struct scan_control *sc)
>  {
>  	if (is_active_lru(lru)) {
> -		if (inactive_list_is_low(lruvec, is_file_lru(lru)))
> +		if (inactive_list_is_low(lruvec, is_file_lru(lru), sc))
>  			shrink_active_list(nr_to_scan, lruvec, sc, lru);
>  		return 0;
>  	}
> @@ -2137,7 +2164,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  	 * lruvec even if it has plenty of old anonymous pages unless the
>  	 * system is under heavy pressure.
>  	 */
> -	if (!inactive_list_is_low(lruvec, true) &&
> +	if (!inactive_list_is_low(lruvec, true, sc) &&
>  	    lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
>  		scan_balance = SCAN_FILE;
>  		goto out;
> @@ -2379,7 +2406,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
>  	 * Even if we did not try to evict anon pages at all, we want to
>  	 * rebalance the anon lru active/inactive ratio.
>  	 */
> -	if (inactive_list_is_low(lruvec, false))
> +	if (inactive_list_is_low(lruvec, false, sc))
>  		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>  				   sc, LRU_ACTIVE_ANON);
>  
> @@ -3032,7 +3059,7 @@ static void age_active_anon(struct pglist_data *pgdat,
>  	do {
>  		struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
>  
> -		if (inactive_list_is_low(lruvec, false))
> +		if (inactive_list_is_low(lruvec, false, sc))
>  			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>  					   sc, LRU_ACTIVE_ANON);
>  
> -- 
> 2.6.4
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v1
  2016-07-20 15:21 [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v1 Mel Gorman
                   ` (4 preceding siblings ...)
  2016-07-20 15:21 ` [PATCH 5/5] mm: consider per-zone inactive ratio to deactivate Mel Gorman
@ 2016-07-21  7:07 ` Minchan Kim
  2016-07-21  9:15   ` Mel Gorman
  2016-07-21  7:31 ` Joonsoo Kim
  6 siblings, 1 reply; 24+ messages in thread
From: Minchan Kim @ 2016-07-21  7:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

Hi Mel,

On Wed, Jul 20, 2016 at 04:21:46PM +0100, Mel Gorman wrote:
> Both Joonsoo Kim and Minchan Kim have reported premature OOM kills on
> a 32-bit platform. The common element is a zone-constrained high-order
> allocation failing. Two factors appear to be at fault -- pgdat being

Strictly speaking, my case is order-0 allocation failing, not high-order.
;)

> considered unreclaimable prematurely and insufficient rotation of the
> active list.
> 
> Unfortunately to date I have been unable to reproduce this with a variety
> of stress workloads on a 2G 32-bit KVM instance. It's not clear why as
> the steps are similar to what was described. It means I've been unable to
> determine if this series addresses the problem or not. I'm hoping they can
> test and report back before these are merged to mmotm. What I have checked
> is that a basic parallel DD workload completed successfully on the same
> machine I used for the node-lru performance tests. I'll leave the other
> tests running just in case anything interesting falls out.
> 
> The series is in three basic parts;
> 
> Patch 1 does not account for skipped pages as scanned. This avoids the pgdat
> 	being prematurely marked unreclaimable
> 
> Patches 2-4 add per-zone stats back in. The actual stats patch is different
> 	to Minchan's as the original patch did not account for unevictable
> 	LRU which would corrupt counters. The second two patches remove
> 	approximations based on pgdat statistics. It's effectively a
> 	revert of "mm, vmstat: remove zone and node double accounting by
> 	approximating retries" but different LRU stats are used. This
> 	is better than a full revert or a reworking of the series as
> 	it preserves history of why the zone stats are necessary.
> 
> 	If this work out, we may have to leave the double accounting in
> 	place for now until an alternative cheap solution presents itself.
> 
> Patch 5 rotates inactive/active lists for lowmem allocations. This is also
> 	quite different to Minchan's patch as the original patch did not
> 	account for memcg and would rotate if *any* eligible zone needed
> 	rotation which may rotate excessively. The new patch considers
> 	the ratio for all eligible zones which is more in line with
> 	node-lru in general.
> 

Now I tested and confirmed it works for me at the OOM point of view.
IOW, I cannot see OOM kill any more. But note that I tested it
without [1/5] which has a problem I mentioned in that thread.

If you want to merge [1/5], please resend updated version but
I doubt we need it at this moment.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] mm: add per-zone lru list stat
  2016-07-20 15:21 ` [PATCH 2/5] mm: add per-zone lru list stat Mel Gorman
@ 2016-07-21  7:10   ` Joonsoo Kim
  2016-07-23  0:45     ` Fengguang Wu
  0 siblings, 1 reply; 24+ messages in thread
From: Joonsoo Kim @ 2016-07-21  7:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Minchan Kim, Michal Hocko,
	Vlastimil Babka, Linux-MM, LKML

On Wed, Jul 20, 2016 at 04:21:48PM +0100, Mel Gorman wrote:
> From: Minchan Kim <minchan@kernel.org>
> 
> While I did stress test with hackbench, I got OOM message frequently which
> didn't ever happen in zone-lru.
> 
> gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
> ..
> ..
>  [<c71a76e2>] __alloc_pages_nodemask+0xe52/0xe60
>  [<c71f31dc>] ? new_slab+0x39c/0x3b0
>  [<c71f31dc>] new_slab+0x39c/0x3b0
>  [<c71f4eca>] ___slab_alloc.constprop.87+0x6da/0x840
>  [<c763e6fc>] ? __alloc_skb+0x3c/0x260
>  [<c777e127>] ? _raw_spin_unlock_irq+0x27/0x60
>  [<c70cebfc>] ? trace_hardirqs_on_caller+0xec/0x1b0
>  [<c70a1506>] ? finish_task_switch+0xa6/0x220
>  [<c7219ee0>] ? poll_select_copy_remaining+0x140/0x140
>  [<c7201645>] __slab_alloc.isra.81.constprop.86+0x40/0x6d
>  [<c763e6fc>] ? __alloc_skb+0x3c/0x260
>  [<c71f525c>] kmem_cache_alloc+0x22c/0x260
>  [<c763e6fc>] ? __alloc_skb+0x3c/0x260
>  [<c763e6fc>] __alloc_skb+0x3c/0x260
>  [<c763eece>] alloc_skb_with_frags+0x4e/0x1a0
>  [<c7638d6a>] sock_alloc_send_pskb+0x16a/0x1b0
>  [<c770b581>] ? wait_for_unix_gc+0x31/0x90
>  [<c71cfb1d>] ? alloc_set_pte+0x2ad/0x310
>  [<c77084dd>] unix_stream_sendmsg+0x28d/0x340
>  [<c7634dad>] sock_sendmsg+0x2d/0x40
>  [<c7634e2c>] sock_write_iter+0x6c/0xc0
>  [<c7204a90>] __vfs_write+0xc0/0x120
>  [<c72053ab>] vfs_write+0x9b/0x1a0
>  [<c71cc4a9>] ? __might_fault+0x49/0xa0
>  [<c72062c4>] SyS_write+0x44/0x90
>  [<c70036c6>] do_fast_syscall_32+0xa6/0x1e0
>  [<c777ea2c>] sysenter_past_esp+0x45/0x74
> 
> Mem-Info:
> active_anon:104698 inactive_anon:105791 isolated_anon:192
>  active_file:433 inactive_file:283 isolated_file:22
>  unevictable:0 dirty:0 writeback:296 unstable:0
>  slab_reclaimable:6389 slab_unreclaimable:78927
>  mapped:474 shmem:0 pagetables:101426 bounce:0
>  free:10518 free_pcp:334 free_cma:0
> Node 0 active_anon:418792kB inactive_anon:423164kB active_file:1732kB inactive_file:1132kB unevictable:0kB isolated(anon):768kB isolated(file):88kB mapped:1896kB dirty:0kB writeback:1184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1478632 all_unreclaimable? yes
> DMA free:3304kB min:68kB low:84kB high:100kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:4088kB kernel_stack:0kB pagetables:2480kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 809 1965 1965
> Normal free:3436kB min:3604kB low:4504kB high:5404kB present:897016kB managed:858460kB mlocked:0kB slab_reclaimable:25556kB slab_unreclaimable:311712kB kernel_stack:164608kB pagetables:30844kB bounce:0kB free_pcp:620kB local_pcp:104kB free_cma:0kB
> lowmem_reserve[]: 0 0 9247 9247
> HighMem free:33808kB min:512kB low:1796kB high:3080kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:372252kB bounce:0kB free_pcp:428kB local_pcp:72kB free_cma:0kB
> lowmem_reserve[]: 0 0 0 0
> DMA: 2*4kB (UM) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 0*1024kB 1*2048kB (U) 0*4096kB = 3192kB
> Normal: 33*4kB (MH) 79*8kB (ME) 11*16kB (M) 4*32kB (M) 2*64kB (ME) 2*128kB (EH) 7*256kB (EH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3244kB
> HighMem: 2590*4kB (UM) 1568*8kB (UM) 491*16kB (UM) 60*32kB (UM) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33064kB
> Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> 25121 total pagecache pages
> 24160 pages in swap cache
> Swap cache stats: add 86371, delete 62211, find 42865/60187
> Free swap  = 4015560kB
> Total swap = 4192252kB
> 524186 pages RAM
> 295934 pages HighMem/MovableOnly
> 9658 pages reserved
> 0 pages cma reserved
> 
> The order-0 allocation for normal zone failed while there are a lot of
> reclaimable memory(i.e., anonymous memory with free swap). I wanted to
> analyze the problem but it was hard because we removed per-zone lru stat
> so I couldn't know how many of anonymous memory there are in normal/dma zone.
> 
> When we investigate OOM problem, reclaimable memory count is crucial stat
> to find a problem. Without it, it's hard to parse the OOM message so I
> believe we should keep it.
> 
> With per-zone lru stat,
> 
> gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
> Mem-Info:
> active_anon:101103 inactive_anon:102219 isolated_anon:0
>  active_file:503 inactive_file:544 isolated_file:0
>  unevictable:0 dirty:0 writeback:34 unstable:0
>  slab_reclaimable:6298 slab_unreclaimable:74669
>  mapped:863 shmem:0 pagetables:100998 bounce:0
>  free:23573 free_pcp:1861 free_cma:0
> Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
> DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 809 1965 1965
> Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
> lowmem_reserve[]: 0 0 9247 9247
> HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
> lowmem_reserve[]: 0 0 0 0
> DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
> Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
> HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
> Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> 54409 total pagecache pages
> 53215 pages in swap cache
> Swap cache stats: add 300982, delete 247765, find 157978/226539
> Free swap  = 3803244kB
> Total swap = 4192252kB
> 524186 pages RAM
> 295934 pages HighMem/MovableOnly
> 9642 pages reserved
> 0 pages cma reserved
> 
> With that, we can see normal zone has a 86M reclaimable memory so we can
> know something goes wrong(I will fix the problem in next patch) in reclaim.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  include/linux/mm_inline.h |  2 ++
>  include/linux/mmzone.h    |  6 ++++++
>  mm/page_alloc.c           | 10 ++++++++++
>  mm/vmstat.c               |  5 +++++
>  4 files changed, 23 insertions(+)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index bcc4ed07fa90..9cc130f5feb2 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -45,6 +45,8 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
>  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>  
>  	__mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
> +	__mod_zone_page_state(&pgdat->node_zones[zid],
> +				NR_ZONE_LRU_BASE + lru, nr_pages);
>  	acct_highmem_file_pages(zid, lru, nr_pages);
>  }

Hello, Mel and Minchan.

Above change is not sufficient to update zone stat properly.
We should also change update_lru_sizes() to use proper zid even if
!CONFIG_HIGHMEM. My test setup is 64 bit with movable zone and in this
case, updaing is done wrongly.

Thanks.

>  
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index e6aca07cedb7..72625b04e9ba 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -110,6 +110,12 @@ struct zone_padding {
>  enum zone_stat_item {
>  	/* First 128 byte cacheline (assuming 64 bit words) */
>  	NR_FREE_PAGES,
> +	NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
> +	NR_ZONE_INACTIVE_ANON = NR_ZONE_LRU_BASE,
> +	NR_ZONE_ACTIVE_ANON,
> +	NR_ZONE_INACTIVE_FILE,
> +	NR_ZONE_ACTIVE_FILE,
> +	NR_ZONE_UNEVICTABLE,
>  	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
>  	NR_SLAB_RECLAIMABLE,
>  	NR_SLAB_UNRECLAIMABLE,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 830ad49a584a..b44c9a8d879a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4388,6 +4388,11 @@ void show_free_areas(unsigned int filter)
>  			" min:%lukB"
>  			" low:%lukB"
>  			" high:%lukB"
> +			" active_anon:%lukB"
> +			" inactive_anon:%lukB"
> +			" active_file:%lukB"
> +			" inactive_file:%lukB"
> +			" unevictable:%lukB"
>  			" present:%lukB"
>  			" managed:%lukB"
>  			" mlocked:%lukB"
> @@ -4405,6 +4410,11 @@ void show_free_areas(unsigned int filter)
>  			K(min_wmark_pages(zone)),
>  			K(low_wmark_pages(zone)),
>  			K(high_wmark_pages(zone)),
> +			K(zone_page_state(zone, NR_ZONE_ACTIVE_ANON)),
> +			K(zone_page_state(zone, NR_ZONE_INACTIVE_ANON)),
> +			K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)),
> +			K(zone_page_state(zone, NR_ZONE_INACTIVE_FILE)),
> +			K(zone_page_state(zone, NR_ZONE_UNEVICTABLE)),
>  			K(zone->present_pages),
>  			K(zone->managed_pages),
>  			K(zone_page_state(zone, NR_MLOCK)),
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 91ecca96dcae..f10aad81a9a3 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -921,6 +921,11 @@ int fragmentation_index(struct zone *zone, unsigned int order)
>  const char * const vmstat_text[] = {
>  	/* enum zone_stat_item countes */
>  	"nr_free_pages",
> +	"nr_inactive_anon",
> +	"nr_active_anon",
> +	"nr_inactive_file",
> +	"nr_active_file",
> +	"nr_unevictable",
>  	"nr_mlock",
>  	"nr_slab_reclaimable",
>  	"nr_slab_unreclaimable",
> -- 
> 2.6.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] mm: consider per-zone inactive ratio to deactivate
  2016-07-20 15:21 ` [PATCH 5/5] mm: consider per-zone inactive ratio to deactivate Mel Gorman
  2016-07-21  5:30   ` Minchan Kim
@ 2016-07-21  7:10   ` Joonsoo Kim
  2016-07-21  8:16     ` Mel Gorman
  1 sibling, 1 reply; 24+ messages in thread
From: Joonsoo Kim @ 2016-07-21  7:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Minchan Kim, Michal Hocko,
	Vlastimil Babka, Linux-MM, LKML

On Wed, Jul 20, 2016 at 04:21:51PM +0100, Mel Gorman wrote:
> From: Minchan Kim <minchan@kernel.org>
> 
> Minchan Kim reported that with per-zone lru state it was possible to
> identify that a normal zone with 8^M anonymous pages could trigger
> OOM with non-atomic order-0 allocations as all pages in the zone
> were in the active list.
> 
>    gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
>    Call Trace:
>     [<c51a76e2>] __alloc_pages_nodemask+0xe52/0xe60
>     [<c51f31dc>] ? new_slab+0x39c/0x3b0
>     [<c51f31dc>] new_slab+0x39c/0x3b0
>     [<c51f4eca>] ___slab_alloc.constprop.87+0x6da/0x840
>     [<c563e6fc>] ? __alloc_skb+0x3c/0x260
>     [<c50b8e93>] ? enqueue_task_fair+0x73/0xbf0
>     [<c5219ee0>] ? poll_select_copy_remaining+0x140/0x140
>     [<c5201645>] __slab_alloc.isra.81.constprop.86+0x40/0x6d
>     [<c563e6fc>] ? __alloc_skb+0x3c/0x260
>     [<c51f525c>] kmem_cache_alloc+0x22c/0x260
>     [<c563e6fc>] ? __alloc_skb+0x3c/0x260
>     [<c563e6fc>] __alloc_skb+0x3c/0x260
>     [<c563eece>] alloc_skb_with_frags+0x4e/0x1a0
>     [<c5638d6a>] sock_alloc_send_pskb+0x16a/0x1b0
>     [<c570b581>] ? wait_for_unix_gc+0x31/0x90
>     [<c57084dd>] unix_stream_sendmsg+0x28d/0x340
>     [<c5634dad>] sock_sendmsg+0x2d/0x40
>     [<c5634e2c>] sock_write_iter+0x6c/0xc0
>     [<c5204a90>] __vfs_write+0xc0/0x120
>     [<c52053ab>] vfs_write+0x9b/0x1a0
>     [<c51cc4a9>] ? __might_fault+0x49/0xa0
>     [<c52062c4>] SyS_write+0x44/0x90
>     [<c50036c6>] do_fast_syscall_32+0xa6/0x1e0
> 
>    Mem-Info:
>    active_anon:101103 inactive_anon:102219 isolated_anon:0
>     active_file:503 inactive_file:544 isolated_file:0
>     unevictable:0 dirty:0 writeback:34 unstable:0
>     slab_reclaimable:6298 slab_unreclaimable:74669
>     mapped:863 shmem:0 pagetables:100998 bounce:0
>     free:23573 free_pcp:1861 free_cma:0
>    Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
>    DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
>    lowmem_reserve[]: 0 809 1965 1965
>    Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
>    lowmem_reserve[]: 0 0 9247 9247
>    HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
>    lowmem_reserve[]: 0 0 0 0
>    DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
>    Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
>    HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
>    Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
>    54409 total pagecache pages
>    53215 pages in swap cache
>    Swap cache stats: add 300982, delete 247765, find 157978/226539
>    Free swap  = 3803244kB
>    Total swap = 4192252kB
>    524186 pages RAM
>    295934 pages HighMem/MovableOnly
>    9642 pages reserved
>    0 pages cma reserved
> 
> The problem is due to the active deactivation logic in inactive_list_is_low.
> 
> 	Node 0 active_anon:404412kB inactive_anon:409040kB
> 
> IOW, (inactive_anon of node * inactive_ratio > active_anon of node) due to
> highmem anonymous stat so VM never deactivates normal zone's anonymous pages.
> 
> This patch is a modified version of Minchan's original solution but based
> upon it. The problem with Minchan's patch is that it didn't take memcg
> into account and any low zone with an imbalanced list could force a rotation.
> 
> In this page, a zone-constrained global reclaim will rotate the list if
> the inactive/active ratio of all eligible zones needs to be corrected. It
> is possible that higher zone pages will be initially rotated prematurely
> but this is the safer choice to maintain overall LRU age.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  mm/vmscan.c | 37 ++++++++++++++++++++++++++++++++-----
>  1 file changed, 32 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8f5959469079..dddf73f4293c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1976,7 +1976,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
>   *    1TB     101        10GB
>   *   10TB     320        32GB
>   */
> -static bool inactive_list_is_low(struct lruvec *lruvec, bool file)
> +static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
> +						struct scan_control *sc)
>  {
>  	unsigned long inactive_ratio;
>  	unsigned long inactive;
> @@ -1993,6 +1994,32 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file)
>  	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
>  	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
>  
> +	/*
> +	 * For global reclaim on zone-constrained allocations, it is necessary
> +	 * to check if rotations are required for lowmem to be reclaimed. This
> +	 * calculates the inactive/active pages available in eligible zones.
> +	 */
> +	if (global_reclaim(sc)) {
> +		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +		int zid;
> +
> +		for (zid = sc->reclaim_idx; zid < MAX_NR_ZONES; zid++) {

Should be changed to "zid = sc->reclaim_idx + 1"

Thanks.

> +			struct zone *zone = &pgdat->node_zones[zid];
> +			unsigned long inactive_zone, active_zone;
> +
> +			if (!populated_zone(zone))
> +				continue;
> +
> +			inactive_zone = zone_page_state(zone,
> +					NR_ZONE_LRU_BASE + (file * LRU_FILE));
> +			active_zone = zone_page_state(zone,
> +					NR_ZONE_LRU_BASE + (file * LRU_FILE) + LRU_ACTIVE);
> +
> +			inactive -= min(inactive, inactive_zone);
> +			active -= min(active, active_zone);
> +		}
> +	}
> +
>  	gb = (inactive + active) >> (30 - PAGE_SHIFT);
>  	if (gb)
>  		inactive_ratio = int_sqrt(10 * gb);
> @@ -2006,7 +2033,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
>  				 struct lruvec *lruvec, struct scan_control *sc)
>  {
>  	if (is_active_lru(lru)) {
> -		if (inactive_list_is_low(lruvec, is_file_lru(lru)))
> +		if (inactive_list_is_low(lruvec, is_file_lru(lru), sc))
>  			shrink_active_list(nr_to_scan, lruvec, sc, lru);
>  		return 0;
>  	}
> @@ -2137,7 +2164,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  	 * lruvec even if it has plenty of old anonymous pages unless the
>  	 * system is under heavy pressure.
>  	 */
> -	if (!inactive_list_is_low(lruvec, true) &&
> +	if (!inactive_list_is_low(lruvec, true, sc) &&
>  	    lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
>  		scan_balance = SCAN_FILE;
>  		goto out;
> @@ -2379,7 +2406,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
>  	 * Even if we did not try to evict anon pages at all, we want to
>  	 * rebalance the anon lru active/inactive ratio.
>  	 */
> -	if (inactive_list_is_low(lruvec, false))
> +	if (inactive_list_is_low(lruvec, false, sc))
>  		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>  				   sc, LRU_ACTIVE_ANON);
>  
> @@ -3032,7 +3059,7 @@ static void age_active_anon(struct pglist_data *pgdat,
>  	do {
>  		struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
>  
> -		if (inactive_list_is_low(lruvec, false))
> +		if (inactive_list_is_low(lruvec, false, sc))
>  			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>  					   sc, LRU_ACTIVE_ANON);
>  
> -- 
> 2.6.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v1
  2016-07-20 15:21 [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v1 Mel Gorman
                   ` (5 preceding siblings ...)
  2016-07-21  7:07 ` [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v1 Minchan Kim
@ 2016-07-21  7:31 ` Joonsoo Kim
  2016-07-21  8:39   ` Minchan Kim
  2016-07-21  9:16   ` Mel Gorman
  6 siblings, 2 replies; 24+ messages in thread
From: Joonsoo Kim @ 2016-07-21  7:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Minchan Kim, Michal Hocko,
	Vlastimil Babka, Linux-MM, LKML

On Wed, Jul 20, 2016 at 04:21:46PM +0100, Mel Gorman wrote:
> Both Joonsoo Kim and Minchan Kim have reported premature OOM kills on
> a 32-bit platform. The common element is a zone-constrained high-order
> allocation failing. Two factors appear to be at fault -- pgdat being
> considered unreclaimable prematurely and insufficient rotation of the
> active list.
> 
> Unfortunately to date I have been unable to reproduce this with a variety
> of stress workloads on a 2G 32-bit KVM instance. It's not clear why as
> the steps are similar to what was described. It means I've been unable to
> determine if this series addresses the problem or not. I'm hoping they can
> test and report back before these are merged to mmotm. What I have checked
> is that a basic parallel DD workload completed successfully on the same
> machine I used for the node-lru performance tests. I'll leave the other
> tests running just in case anything interesting falls out.

Hello, Mel.

I tested this series and it doesn't solve my problem. But, with this
series and one change below, my problem is solved.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f5ab357..d451c29 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1819,7 +1819,7 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 
                nr_pages = hpage_nr_pages(page);
                update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
-               list_move(&page->lru, &lruvec->lists[lru]);
+               list_move_tail(&page->lru, &lruvec->lists[lru]);
                pgmoved += nr_pages;
 
                if (put_page_testzero(page)) {

It is brain-dead work-around so it is better you to find a better solution.

I guess that, in my test, file reference happens very quickly. So, if there are
many skip candidates, reclaimable pages on lower zone cannot be reclaimed easily
due to re-reference. If I apply above work-around, the test is finally passed.

One more note that, in my test, 1/5 patch have a negative impact. Sometime,
system lock-up happens and elapsed time is also worse than the test without it.

Anyway, it'd be good to post my test script and program.

setup: 64 bit 2000 MB (500 MB DMA32 and 1500 MB MOVABLE)

sudo swapoff -a
file-read 1500 0 &
file-read 1500 0 &

while(1)
 ./fork 3000 0

Thanks.


file-read.c
-----------------
#include <stdio.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>

#define MB (1024 * 1024)
#define PAGE_SIZE (4096)
#define TEST_FILE "XXXX"

static void touch_mem_seq(void *mem, unsigned long size_mb)
{
        unsigned long i;
        unsigned long size_b;
        char c;

        size_b = size_mb * MB;
        for (i = 0; i < size_b; i += PAGE_SIZE)
                c = *((char *)mem + i);
}

static void touch_mem_rand(void *mem, unsigned long size_mb)
{
        unsigned long i;
        unsigned long size_b;
        char c;

        size_b = size_mb * MB;
        for (i = 0; i < size_b; i += PAGE_SIZE)
                c = *((char *)mem + rand() % size_b);
}

int main(int argc, char *argv[])
{
        unsigned long size_mb;
        void *mem;
        int fd;
        int type;

        srand(time(NULL));

        if (argc != 3) {
                printf("Invalid argument\n");
                exit(1);
        }

        size_mb = atol(argv[1]);
        if (size_mb < 1 || size_mb > 2048) {
                printf("Invalid argument\n");
                exit(1);
        }

        type = atol(argv[2]);
        if (type != 0 && type != 1) {
                printf("Invalid argument\n");
                exit(1);
        }

        fd = open(TEST_FILE, O_RDWR);
        if (fd < 0) {
                printf("Open failed\n");
                exit(1);
        }

        mem = mmap(NULL, size_mb * MB, PROT_READ, MAP_PRIVATE, fd, 0);
        if (mem == MAP_FAILED) {
                printf ("Out of memory: %lu MB\n", size_mb);
                exit(1);
        }

        while (1) {
                if (!type)
                        touch_mem_seq(mem, size_mb);
                else
                        touch_mem_rand(mem, size_mb);
        }

        return 0;
}



fork.c
------------------
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <signal.h>

int main(int argc, char *argv[])
{
        int i, n;
        pid_t pid;
        pid_t *pids;

        if (argc != 2) {
                printf("Invalid argument\n");
                exit(1);
        }

        n = atoi(argv[1]);
        pids = malloc(n * sizeof(pid_t));
        if (!pids) {
                printf("Out of memory\n");
                exit(1);
        }

        for (i = 0; i < n; i++) {
                pid = fork();
                if (pid == 0)
                        sleep(1000);

                if (pid == -1) {
                        i--;
                        continue;
                }
                pids[i] = pid;
                if (i % 100 == 0)
                        printf("Child forked: %d\n", i);
        }

        for (i = 0; i < n; i++) {
                kill(pids[i], SIGTERM);
        }

        sleep(1);
        printf("Parent finished\n");
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] mm: consider per-zone inactive ratio to deactivate
  2016-07-21  5:30   ` Minchan Kim
@ 2016-07-21  8:08     ` Mel Gorman
  0 siblings, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2016-07-21  8:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Thu, Jul 21, 2016 at 02:30:17PM +0900, Minchan Kim wrote:
> > The problem is due to the active deactivation logic in inactive_list_is_low.
> > 
> > 	Node 0 active_anon:404412kB inactive_anon:409040kB
> > 
> > IOW, (inactive_anon of node * inactive_ratio > active_anon of node) due to
> > highmem anonymous stat so VM never deactivates normal zone's anonymous pages.
> > 
> > This patch is a modified version of Minchan's original solution but based
> > upon it. The problem with Minchan's patch is that it didn't take memcg
> > into account and any low zone with an imbalanced list could force a rotation.
> 
> Could you explan why we should consider memcg here?
> 

It already was and there is no good reason to ignore it if it's memcg
reclaim.

> > In this page, a zone-constrained global reclaim will rotate the list if
> 
>           patch,
> 

I'll fix it.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] mm, vmscan: Do not account skipped pages as scanned
  2016-07-21  5:16   ` Minchan Kim
@ 2016-07-21  8:15     ` Mel Gorman
  2016-07-21  8:31       ` Minchan Kim
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2016-07-21  8:15 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Thu, Jul 21, 2016 at 02:16:48PM +0900, Minchan Kim wrote:
> On Wed, Jul 20, 2016 at 04:21:47PM +0100, Mel Gorman wrote:
> > Page reclaim determines whether a pgdat is unreclaimable by examining how
> > many pages have been scanned since a page was freed and comparing that
> > to the LRU sizes. Skipped pages are not considered reclaim candidates but
> > contribute to scanned. This can prematurely mark a pgdat as unreclaimable
> > and trigger an OOM kill.
> > 
> > While this does not fix an OOM kill message reported by Joonsoo Kim,
> > it did stop pgdat being marked unreclaimable.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> >  mm/vmscan.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 22aec2bcfeec..b16d578ce556 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1415,7 +1415,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >  	LIST_HEAD(pages_skipped);
> >  
> >  	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
> > -					!list_empty(src); scan++) {
> > +					!list_empty(src);) {
> >  		struct page *page;
> >  
> >  		page = lru_to_page(src);
> > @@ -1429,6 +1429,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >  			continue;
> >  		}
> >  
> > +		/* Pages skipped do not contribute to scan */
> 
> The comment should explain why.
> 
> /* Pages skipped do not contribute to scan to prevent premature OOM */
> 

Specifically, it's to prevent pgdat being considered unreclaimable
prematurely. I'll update the comment.

> 
> > +		scan++;
> > +
> 
> 
> The one of my concern about node-lru is to add more lru lock contetion
> in multiple zone system so such unbounded skip scanning under the lock
> should have a limit to prevent latency spike and serialization of
> current reclaim work.
> 

The LRU lock already was quite a large lock, particularly on NUMA systems,
with contention raising the more direct reclaimers that are active. It's
worth remembering that the series also shows much lower system CPU time
in some tests. This is the current CPU usage breakdown for a parallel dd test

           4.7.0-rc4   4.7.0-rc7   4.7.0-rc7
        mmotm-20160623mm1-followup-v3r1mm1-oomfix-v4r2
User         1548.01      927.23      777.74
System       8609.71     5540.02     4445.56
Elapsed      3587.10     3598.00     3498.54

The LRU lock is held during skips but it's also doing no real work.

> Another concern is big mismatch between the number of pages from list and
> LRU stat count because lruvec_lru_size call sites don't take the stat
> under the lock while isolate_lru_pages moves many pages from lru list
> to temporal skipped list.
> 

It's already known that the reading of the LRU size can mismatch the
actual size. It's why inactive_list_is_low() in the last patch has
checks like

inactive -= min(inactive, inactive_zone);

It's watching for underflows

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] mm: consider per-zone inactive ratio to deactivate
  2016-07-21  7:10   ` Joonsoo Kim
@ 2016-07-21  8:16     ` Mel Gorman
  0 siblings, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2016-07-21  8:16 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Johannes Weiner, Minchan Kim, Michal Hocko,
	Vlastimil Babka, Linux-MM, LKML

On Thu, Jul 21, 2016 at 04:10:50PM +0900, Joonsoo Kim wrote:
> > @@ -1993,6 +1994,32 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file)
> >  	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> >  	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> >  
> > +	/*
> > +	 * For global reclaim on zone-constrained allocations, it is necessary
> > +	 * to check if rotations are required for lowmem to be reclaimed. This
> > +	 * calculates the inactive/active pages available in eligible zones.
> > +	 */
> > +	if (global_reclaim(sc)) {
> > +		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > +		int zid;
> > +
> > +		for (zid = sc->reclaim_idx; zid < MAX_NR_ZONES; zid++) {
> 
> Should be changed to "zid = sc->reclaim_idx + 1"
> 

You're right, well spotted!

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] mm, vmscan: Do not account skipped pages as scanned
  2016-07-21  8:15     ` Mel Gorman
@ 2016-07-21  8:31       ` Minchan Kim
  0 siblings, 0 replies; 24+ messages in thread
From: Minchan Kim @ 2016-07-21  8:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Thu, Jul 21, 2016 at 09:15:06AM +0100, Mel Gorman wrote:
> On Thu, Jul 21, 2016 at 02:16:48PM +0900, Minchan Kim wrote:
> > On Wed, Jul 20, 2016 at 04:21:47PM +0100, Mel Gorman wrote:
> > > Page reclaim determines whether a pgdat is unreclaimable by examining how
> > > many pages have been scanned since a page was freed and comparing that
> > > to the LRU sizes. Skipped pages are not considered reclaim candidates but
> > > contribute to scanned. This can prematurely mark a pgdat as unreclaimable
> > > and trigger an OOM kill.
> > > 
> > > While this does not fix an OOM kill message reported by Joonsoo Kim,
> > > it did stop pgdat being marked unreclaimable.
> > > 
> > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > > ---
> > >  mm/vmscan.c | 5 ++++-
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 22aec2bcfeec..b16d578ce556 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1415,7 +1415,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > >  	LIST_HEAD(pages_skipped);
> > >  
> > >  	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
> > > -					!list_empty(src); scan++) {
> > > +					!list_empty(src);) {
> > >  		struct page *page;
> > >  
> > >  		page = lru_to_page(src);
> > > @@ -1429,6 +1429,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > >  			continue;
> > >  		}
> > >  
> > > +		/* Pages skipped do not contribute to scan */
> > 
> > The comment should explain why.
> > 
> > /* Pages skipped do not contribute to scan to prevent premature OOM */
> > 
> 
> Specifically, it's to prevent pgdat being considered unreclaimable
> prematurely. I'll update the comment.
> 
> > 
> > > +		scan++;
> > > +
> > 
> > 
> > The one of my concern about node-lru is to add more lru lock contetion
> > in multiple zone system so such unbounded skip scanning under the lock
> > should have a limit to prevent latency spike and serialization of
> > current reclaim work.
> > 
> 
> The LRU lock already was quite a large lock, particularly on NUMA systems,
> with contention raising the more direct reclaimers that are active. It's
> worth remembering that the series also shows much lower system CPU time
> in some tests. This is the current CPU usage breakdown for a parallel dd test
> 
>            4.7.0-rc4   4.7.0-rc7   4.7.0-rc7
>         mmotm-20160623mm1-followup-v3r1mm1-oomfix-v4r2
> User         1548.01      927.23      777.74
> System       8609.71     5540.02     4445.56
> Elapsed      3587.10     3598.00     3498.54
> 
> The LRU lock is held during skips but it's also doing no real work.

If the inactive LRU list is almost full with higher zone pages,
the unbounded scanning under lru_lock would be disaster because
other reclaimer can be stucked with lru-lock.

With [1/5], testing was slower 100 times(To be honest, I should give
up seeing ending of test). That's why I tested this series without [1/5].

> 
> > Another concern is big mismatch between the number of pages from list and
> > LRU stat count because lruvec_lru_size call sites don't take the stat
> > under the lock while isolate_lru_pages moves many pages from lru list
> > to temporal skipped list.
> > 
> 
> It's already known that the reading of the LRU size can mismatch the
> actual size. It's why inactive_list_is_low() in the last patch has
> checks like
> 
> inactive -= min(inactive, inactive_zone);
> 
> It's watching for underflows
> 
> -- 
> Mel Gorman
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v1
  2016-07-21  7:31 ` Joonsoo Kim
@ 2016-07-21  8:39   ` Minchan Kim
  2016-07-21  9:16   ` Mel Gorman
  1 sibling, 0 replies; 24+ messages in thread
From: Minchan Kim @ 2016-07-21  8:39 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Mel Gorman, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vlastimil Babka, Linux-MM, LKML

On Thu, Jul 21, 2016 at 04:31:56PM +0900, Joonsoo Kim wrote:
> On Wed, Jul 20, 2016 at 04:21:46PM +0100, Mel Gorman wrote:
> > Both Joonsoo Kim and Minchan Kim have reported premature OOM kills on
> > a 32-bit platform. The common element is a zone-constrained high-order
> > allocation failing. Two factors appear to be at fault -- pgdat being
> > considered unreclaimable prematurely and insufficient rotation of the
> > active list.
> > 
> > Unfortunately to date I have been unable to reproduce this with a variety
> > of stress workloads on a 2G 32-bit KVM instance. It's not clear why as
> > the steps are similar to what was described. It means I've been unable to
> > determine if this series addresses the problem or not. I'm hoping they can
> > test and report back before these are merged to mmotm. What I have checked
> > is that a basic parallel DD workload completed successfully on the same
> > machine I used for the node-lru performance tests. I'll leave the other
> > tests running just in case anything interesting falls out.
> 
> Hello, Mel.
> 
> I tested this series and it doesn't solve my problem. But, with this
> series and one change below, my problem is solved.
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f5ab357..d451c29 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1819,7 +1819,7 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
>  
>                 nr_pages = hpage_nr_pages(page);
>                 update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
> -               list_move(&page->lru, &lruvec->lists[lru]);
> +               list_move_tail(&page->lru, &lruvec->lists[lru]);
>                 pgmoved += nr_pages;
>  
>                 if (put_page_testzero(page)) {
> 
> It is brain-dead work-around so it is better you to find a better solution.

I tested below patch roughly and it enhanced performance a lot.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index cd68a18..9061e5a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1809,7 +1809,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 static void move_active_pages_to_lru(struct lruvec *lruvec,
 				     struct list_head *list,
 				     struct list_head *pages_to_free,
-				     enum lru_list lru)
+				     enum lru_list lru,
+				     bool tail)
 {
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	unsigned long pgmoved = 0;
@@ -1825,7 +1826,10 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 
 		nr_pages = hpage_nr_pages(page);
 		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
-		list_move(&page->lru, &lruvec->lists[lru]);
+		if (!tail)
+			list_move(&page->lru, &lruvec->lists[lru]);
+		else
+			list_move_tail(&page->lru, &lruvec->lists[lru]);
 		pgmoved += nr_pages;
 
 		if (put_page_testzero(page)) {
@@ -1847,6 +1851,47 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 		__count_vm_events(PGDEACTIVATE, pgmoved);
 }
 
+static bool inactive_list_is_extreme_low(struct lruvec *lruvec, bool file,
+						struct scan_control *sc)
+{
+	unsigned long inactive;
+
+	/*
+	 * If we don't have swap space, anonymous page deactivation
+	 * is pointless.
+	 */
+	if (!file && !total_swap_pages)
+		return false;
+
+	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
+
+	/*
+	 * For global reclaim on zone-constrained allocations, it is necessary
+	 * to check if rotations are required for lowmem to be reclaimed. This
+	 * calculates the inactive/active pages available in eligible zones.
+	 */
+	if (global_reclaim(sc)) {
+		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+		int zid;
+
+		for (zid = sc->reclaim_idx + 1; zid < MAX_NR_ZONES; zid++) {
+			struct zone *zone = &pgdat->node_zones[zid];
+			unsigned long inactive_zone;
+
+			if (!populated_zone(zone))
+				continue;
+
+			inactive_zone = zone_page_state(zone,
+					NR_ZONE_LRU_BASE + (file * LRU_FILE));
+
+			inactive -= min(inactive, inactive_zone);
+		}
+	}
+
+
+	return inactive <= (SWAP_CLUSTER_MAX * num_online_cpus());
+}
+
 static void shrink_active_list(unsigned long nr_to_scan,
 			       struct lruvec *lruvec,
 			       struct scan_control *sc,
@@ -1937,9 +1982,11 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	 * get_scan_count.
 	 */
 	reclaim_stat->recent_rotated[file] += nr_rotated;
+	move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru, false);
+	move_active_pages_to_lru(lruvec, &l_inactive,
+		&l_hold, lru - LRU_ACTIVE,
+		inactive_list_is_extreme_low(lruvec, is_file_lru(lru), sc));
 
-	move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
-	move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 	spin_unlock_irq(&pgdat->lru_lock);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v1
  2016-07-21  7:07 ` [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v1 Minchan Kim
@ 2016-07-21  9:15   ` Mel Gorman
  0 siblings, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2016-07-21  9:15 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Thu, Jul 21, 2016 at 04:07:14PM +0900, Minchan Kim wrote:
> Hi Mel,
> 
> On Wed, Jul 20, 2016 at 04:21:46PM +0100, Mel Gorman wrote:
> > Both Joonsoo Kim and Minchan Kim have reported premature OOM kills on
> > a 32-bit platform. The common element is a zone-constrained high-order
> > allocation failing. Two factors appear to be at fault -- pgdat being
> 
> Strictly speaking, my case is order-0 allocation failing, not high-order.
> ;)
> 

I'll update the leader mail.

> > considered unreclaimable prematurely and insufficient rotation of the
> > active list.
> > 
> > Unfortunately to date I have been unable to reproduce this with a variety
> > of stress workloads on a 2G 32-bit KVM instance. It's not clear why as
> > the steps are similar to what was described. It means I've been unable to
> > determine if this series addresses the problem or not. I'm hoping they can
> > test and report back before these are merged to mmotm. What I have checked
> > is that a basic parallel DD workload completed successfully on the same
> > machine I used for the node-lru performance tests. I'll leave the other
> > tests running just in case anything interesting falls out.
> > 
> > The series is in three basic parts;
> > 
> > Patch 1 does not account for skipped pages as scanned. This avoids the pgdat
> > 	being prematurely marked unreclaimable
> > 
> > Patches 2-4 add per-zone stats back in. The actual stats patch is different
> > 	to Minchan's as the original patch did not account for unevictable
> > 	LRU which would corrupt counters. The second two patches remove
> > 	approximations based on pgdat statistics. It's effectively a
> > 	revert of "mm, vmstat: remove zone and node double accounting by
> > 	approximating retries" but different LRU stats are used. This
> > 	is better than a full revert or a reworking of the series as
> > 	it preserves history of why the zone stats are necessary.
> > 
> > 	If this work out, we may have to leave the double accounting in
> > 	place for now until an alternative cheap solution presents itself.
> > 
> > Patch 5 rotates inactive/active lists for lowmem allocations. This is also
> > 	quite different to Minchan's patch as the original patch did not
> > 	account for memcg and would rotate if *any* eligible zone needed
> > 	rotation which may rotate excessively. The new patch considers
> > 	the ratio for all eligible zones which is more in line with
> > 	node-lru in general.
> > 
> 
> Now I tested and confirmed it works for me at the OOM point of view.
> IOW, I cannot see OOM kill any more. But note that I tested it
> without [1/5] which has a problem I mentioned in that thread.
> 
> If you want to merge [1/5], please resend updated version but
> I doubt we need it at this moment.

Currently I'm looking at a version that scales skipped pages as a
partial scan unless the LRU has no eligible pages. I'll put the
patch at the end of the series so that it'll be easier to test
in isolation. I'm currently looking to reproduce a case similar
to Joonsoo's.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v1
  2016-07-21  7:31 ` Joonsoo Kim
  2016-07-21  8:39   ` Minchan Kim
@ 2016-07-21  9:16   ` Mel Gorman
  1 sibling, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2016-07-21  9:16 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Johannes Weiner, Minchan Kim, Michal Hocko,
	Vlastimil Babka, Linux-MM, LKML

On Thu, Jul 21, 2016 at 04:31:56PM +0900, Joonsoo Kim wrote:
> On Wed, Jul 20, 2016 at 04:21:46PM +0100, Mel Gorman wrote:
> > Both Joonsoo Kim and Minchan Kim have reported premature OOM kills on
> > a 32-bit platform. The common element is a zone-constrained high-order
> > allocation failing. Two factors appear to be at fault -- pgdat being
> > considered unreclaimable prematurely and insufficient rotation of the
> > active list.
> > 
> > Unfortunately to date I have been unable to reproduce this with a variety
> > of stress workloads on a 2G 32-bit KVM instance. It's not clear why as
> > the steps are similar to what was described. It means I've been unable to
> > determine if this series addresses the problem or not. I'm hoping they can
> > test and report back before these are merged to mmotm. What I have checked
> > is that a basic parallel DD workload completed successfully on the same
> > machine I used for the node-lru performance tests. I'll leave the other
> > tests running just in case anything interesting falls out.
> 
> Hello, Mel.
> 
> I tested this series and it doesn't solve my problem. But, with this
> series and one change below, my problem is solved.
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f5ab357..d451c29 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1819,7 +1819,7 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
>  
>                 nr_pages = hpage_nr_pages(page);
>                 update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
> -               list_move(&page->lru, &lruvec->lists[lru]);
> +               list_move_tail(&page->lru, &lruvec->lists[lru]);
>                 pgmoved += nr_pages;
>  
>                 if (put_page_testzero(page)) {
> 
> It is brain-dead work-around so it is better you to find a better solution.
> 

This wrecks LRU ordering.

> I guess that, in my test, file reference happens very quickly. So, if there are
> many skip candidates, reclaimable pages on lower zone cannot be reclaimed easily
> due to re-reference. If I apply above work-around, the test is finally passed.
> 

I think by scaling skipped pages as partial scan that it may address the
issue.

> One more note that, in my test, 1/5 patch have a negative impact. Sometime,
> system lock-up happens and elapsed time is also worse than the test without it.
> 
> Anyway, it'd be good to post my test script and program.
> 
> setup: 64 bit 2000 MB (500 MB DMA32 and 1500 MB MOVABLE)
> 

Thanks. I partially replicated this with a 32-bit machine and minor
modifications. It triggered an OOM within 5 minutes. I'll test the revised
series shortly and when/if it's successful I'll post a V2 of the series.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] mm: add per-zone lru list stat
  2016-07-21  7:10   ` Joonsoo Kim
@ 2016-07-23  0:45     ` Fengguang Wu
  2016-07-23  1:25       ` Minchan Kim
  0 siblings, 1 reply; 24+ messages in thread
From: Fengguang Wu @ 2016-07-23  0:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Andrew Morton, Johannes Weiner, Michal Hocko,
	Joonsoo Kim, Vlastimil Babka, Linux-MM, LKML, fengguang.wu

Hi Minchan,

We find duplicate /proc/vmstat lines showing up in linux-next, which
look related to this patch.

>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -921,6 +921,11 @@ int fragmentation_index(struct zone *zone, unsigned int order)
>>  const char * const vmstat_text[] = {
>>  	/* enum zone_stat_item countes */
>>  	"nr_free_pages",
>> +	"nr_inactive_anon",
>> +	"nr_active_anon",
>> +	"nr_inactive_file",
>> +	"nr_active_file",
>> +	"nr_unevictable",
>>  	"nr_mlock",
>>  	"nr_slab_reclaimable",
>>  	"nr_slab_unreclaimable",

In the below vmstat output, "nr_inactive_anon 2217" is shown twice.
So do the other entries added by the above chunk.

nr_free_pages 831238
nr_inactive_anon 2217
nr_active_anon 4386
nr_inactive_file 117467
nr_active_file 4602
nr_unevictable 0
nr_zone_write_pending 0
nr_mlock 0
nr_slab_reclaimable 8323
nr_slab_unreclaimable 4641
nr_page_table_pages 870
nr_kernel_stack 3776
nr_bounce 0
nr_zspages 0
numa_hit 201105
numa_miss 0
numa_foreign 0
numa_interleave 66970
numa_local 201105
numa_other 0
nr_free_cma 0
nr_inactive_anon 2217
nr_active_anon 4368
nr_inactive_file 117449
nr_active_file 4620
nr_unevictable 0
nr_isolated_anon 0
nr_isolated_file 0
nr_pages_scanned 0
workingset_refault 0
workingset_activate 0
workingset_nodereclaim 0
nr_anon_pages 4321
nr_mapped 3469
nr_file_pages 124348
nr_dirty 0
nr_writeback 0
nr_writeback_temp 0
nr_shmem 2279
nr_shmem_hugepages 0
nr_shmem_pmdmapped 0
...

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] mm: add per-zone lru list stat
  2016-07-23  0:45     ` Fengguang Wu
@ 2016-07-23  1:25       ` Minchan Kim
  0 siblings, 0 replies; 24+ messages in thread
From: Minchan Kim @ 2016-07-23  1:25 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Mel Gorman, Andrew Morton, Johannes Weiner, Michal Hocko,
	Joonsoo Kim, Vlastimil Babka, Linux-MM, LKML

Hi Fengguang,

On Sat, Jul 23, 2016 at 08:45:15AM +0800, Fengguang Wu wrote:
> Hi Minchan,
> 
> We find duplicate /proc/vmstat lines showing up in linux-next, which
> look related to this patch.
> 
> >>--- a/mm/vmstat.c
> >>+++ b/mm/vmstat.c
> >>@@ -921,6 +921,11 @@ int fragmentation_index(struct zone *zone, unsigned int order)
> >> const char * const vmstat_text[] = {
> >> 	/* enum zone_stat_item countes */
> >> 	"nr_free_pages",
> >>+	"nr_inactive_anon",
> >>+	"nr_active_anon",
> >>+	"nr_inactive_file",
> >>+	"nr_active_file",
> >>+	"nr_unevictable",
> >> 	"nr_mlock",
> >> 	"nr_slab_reclaimable",
> >> 	"nr_slab_unreclaimable",
> 
> In the below vmstat output, "nr_inactive_anon 2217" is shown twice.
> So do the other entries added by the above chunk.
> 
> nr_free_pages 831238
> nr_inactive_anon 2217
> nr_active_anon 4386
> nr_inactive_file 117467
> nr_active_file 4602
> nr_unevictable 0
> nr_zone_write_pending 0
> nr_mlock 0
> nr_slab_reclaimable 8323
> nr_slab_unreclaimable 4641
> nr_page_table_pages 870
> nr_kernel_stack 3776
> nr_bounce 0
> nr_zspages 0
> numa_hit 201105
> numa_miss 0
> numa_foreign 0
> numa_interleave 66970
> numa_local 201105
> numa_other 0
> nr_free_cma 0
> nr_inactive_anon 2217
> nr_active_anon 4368
> nr_inactive_file 117449
> nr_active_file 4620
> nr_unevictable 0
> nr_isolated_anon 0
> nr_isolated_file 0
> nr_pages_scanned 0
> workingset_refault 0
> workingset_activate 0
> workingset_nodereclaim 0
> nr_anon_pages 4321
> nr_mapped 3469
> nr_file_pages 124348
> nr_dirty 0
> nr_writeback 0
> nr_writeback_temp 0
> nr_shmem 2279
> nr_shmem_hugepages 0
> nr_shmem_pmdmapped 0

Thanks for catching that.
We need a decision to maintain LRU stat both per-zone and per-node.

Mel, do you want to keep the LRU stat in per-node in addition?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] mm, vmscan: Do not account skipped pages as scanned
  2016-07-20 15:21 ` [PATCH 1/5] mm, vmscan: Do not account skipped pages as scanned Mel Gorman
  2016-07-21  5:16   ` Minchan Kim
@ 2016-07-25  8:04   ` Minchan Kim
  2016-07-25  9:20     ` Mel Gorman
  1 sibling, 1 reply; 24+ messages in thread
From: Minchan Kim @ 2016-07-25  8:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Wed, Jul 20, 2016 at 04:21:47PM +0100, Mel Gorman wrote:
> Page reclaim determines whether a pgdat is unreclaimable by examining how
> many pages have been scanned since a page was freed and comparing that
> to the LRU sizes. Skipped pages are not considered reclaim candidates but
> contribute to scanned. This can prematurely mark a pgdat as unreclaimable
> and trigger an OOM kill.
> 
> While this does not fix an OOM kill message reported by Joonsoo Kim,
> it did stop pgdat being marked unreclaimable.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  mm/vmscan.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 22aec2bcfeec..b16d578ce556 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1415,7 +1415,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  	LIST_HEAD(pages_skipped);
>  
>  	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
> -					!list_empty(src); scan++) {
> +					!list_empty(src);) {
>  		struct page *page;
>  
>  		page = lru_to_page(src);
> @@ -1429,6 +1429,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  			continue;
>  		}
>  
> +		/* Pages skipped do not contribute to scan */
> +		scan++;
> +

As I mentioned in previous version, under irq-disabled-spin-lock, such
unbounded operation would make the latency spike worse if there are
lot of pages we should skip.

Don't we take care it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] mm, vmscan: Do not account skipped pages as scanned
  2016-07-25  8:04   ` Minchan Kim
@ 2016-07-25  9:20     ` Mel Gorman
  2016-07-28  1:38       ` Minchan Kim
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2016-07-25  9:20 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Mon, Jul 25, 2016 at 05:04:56PM +0900, Minchan Kim wrote:
> > @@ -1429,6 +1429,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >  			continue;
> >  		}
> >  
> > +		/* Pages skipped do not contribute to scan */
> > +		scan++;
> > +
> 
> As I mentioned in previous version, under irq-disabled-spin-lock, such
> unbounded operation would make the latency spike worse if there are
> lot of pages we should skip.
> 
> Don't we take care it?

It's not unbounded, it's bound by the size of the LRU list and it's not
going to be enough to trigger a warning. While the lock hold time may be
undesirable, unlocking it every SWAP_CLUSTER_MAX pages may increase overall
contention. There also is the question of whether skipped pages should be
temporarily putback before unlocking the LRU to avoid isolated pages being
unavailable for too long. It also cannot easily just return early without
prematurely triggering OOM due to a lack of progress. I didn't feel the
complexity was justified.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] mm, vmscan: Do not account skipped pages as scanned
  2016-07-25  9:20     ` Mel Gorman
@ 2016-07-28  1:38       ` Minchan Kim
  0 siblings, 0 replies; 24+ messages in thread
From: Minchan Kim @ 2016-07-28  1:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vlastimil Babka,
	Linux-MM, LKML

On Mon, Jul 25, 2016 at 10:20:14AM +0100, Mel Gorman wrote:
> On Mon, Jul 25, 2016 at 05:04:56PM +0900, Minchan Kim wrote:
> > > @@ -1429,6 +1429,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > >  			continue;
> > >  		}
> > >  
> > > +		/* Pages skipped do not contribute to scan */
> > > +		scan++;
> > > +
> > 
> > As I mentioned in previous version, under irq-disabled-spin-lock, such
> > unbounded operation would make the latency spike worse if there are
> > lot of pages we should skip.
> > 
> > Don't we take care it?
> 
> It's not unbounded, it's bound by the size of the LRU list and it's not
> going to be enough to trigger a warning. While the lock hold time may be
> undesirable, unlocking it every SWAP_CLUSTER_MAX pages may increase overall
> contention. There also is the question of whether skipped pages should be
> temporarily putback before unlocking the LRU to avoid isolated pages being
> unavailable for too long. It also cannot easily just return early without
> prematurely triggering OOM due to a lack of progress. I didn't feel the
> complexity was justified.

I measured the lock holding time and it took max 96ms during 360M
scanning with hackbench. It was very easy to reproduce with node-lru
because it should skip too many pages.

Given that my box is much faster than usual mobile CPU, it would
take more time in embedded system. I think irq disable during 96ms would
be worth to be fixed.

Anyway, I'm done by that which I measured time by hand so it's up to you
that whether you want to fix or leave as it is until someone reports it with
more real workload.

> 
> -- 
> Mel Gorman
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2016-07-28  1:37 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-20 15:21 [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v1 Mel Gorman
2016-07-20 15:21 ` [PATCH 1/5] mm, vmscan: Do not account skipped pages as scanned Mel Gorman
2016-07-21  5:16   ` Minchan Kim
2016-07-21  8:15     ` Mel Gorman
2016-07-21  8:31       ` Minchan Kim
2016-07-25  8:04   ` Minchan Kim
2016-07-25  9:20     ` Mel Gorman
2016-07-28  1:38       ` Minchan Kim
2016-07-20 15:21 ` [PATCH 2/5] mm: add per-zone lru list stat Mel Gorman
2016-07-21  7:10   ` Joonsoo Kim
2016-07-23  0:45     ` Fengguang Wu
2016-07-23  1:25       ` Minchan Kim
2016-07-20 15:21 ` [PATCH 3/5] mm, vmscan: Remove highmem_file_pages Mel Gorman
2016-07-20 15:21 ` [PATCH 4/5] mm: Remove reclaim and compaction retry approximations Mel Gorman
2016-07-20 15:21 ` [PATCH 5/5] mm: consider per-zone inactive ratio to deactivate Mel Gorman
2016-07-21  5:30   ` Minchan Kim
2016-07-21  8:08     ` Mel Gorman
2016-07-21  7:10   ` Joonsoo Kim
2016-07-21  8:16     ` Mel Gorman
2016-07-21  7:07 ` [PATCH 0/5] Candidate fixes for premature OOM kills with node-lru v1 Minchan Kim
2016-07-21  9:15   ` Mel Gorman
2016-07-21  7:31 ` Joonsoo Kim
2016-07-21  8:39   ` Minchan Kim
2016-07-21  9:16   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).