All of lore.kernel.org
 help / color / mirror / Atom feed
* How to make warn_alloc() reliable?
@ 2016-10-18 11:04 ` Tetsuo Handa
  0 siblings, 0 replies; 12+ messages in thread
From: Tetsuo Handa @ 2016-10-18 11:04 UTC (permalink / raw)
  To: mhocko, akpm, hannes, mgorman, dave.hansen; +Cc: linux-mm, linux-kernel

Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for
too long") is a great step for reducing possibility of silent hang up
problem caused by memory allocation stalls. For example, below is a
report where write() request got stuck because it cannot invoke the
OOM killer due to GFP_NOFS allocation request.

---------- From http://I-love.SAKURA.ne.jp/tmp/serial-20161017-xfs-loop.txt.xz ----------
[  351.824548] oom_reaper: reaped process 4727 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  362.309509] warn_alloc: 96 callbacks suppressed
(...snipped...)
[  707.833650] a.out: page alloction stalls for 370009ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[  707.833653] CPU: 3 PID: 4746 Comm: a.out Tainted: G        W       4.9.0-rc1+ #80
[  707.833653] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  707.833656]  ffffc90006d27950 ffffffff812e9777 ffffffff8197c438 0000000000000001
[  707.833657]  ffffc90006d279d8 ffffffff8112a114 0342004a7fffd720 ffffffff8197c438
[  707.833658]  ffffc90006d27978 ffffffff00000010 ffffc90006d279e8 ffffc90006d27998
[  707.833658] Call Trace:
[  707.833662]  [<ffffffff812e9777>] dump_stack+0x4f/0x68
[  707.833665]  [<ffffffff8112a114>] warn_alloc+0x144/0x160
[  707.833666]  [<ffffffff8112aad6>] __alloc_pages_nodemask+0x936/0xe80
[  707.833670]  [<ffffffff81177f07>] alloc_pages_current+0x87/0x110
[  707.833672]  [<ffffffff8111f33c>] __page_cache_alloc+0xdc/0x120
[  707.833673]  [<ffffffff8111fe58>] pagecache_get_page+0x88/0x2b0
[  707.833675]  [<ffffffff81120f5b>] grab_cache_page_write_begin+0x1b/0x40
[  707.833677]  [<ffffffff812036ab>] iomap_write_begin+0x4b/0x100
[  707.833678]  [<ffffffff81203932>] iomap_write_actor+0xb2/0x190
[  707.833680]  [<ffffffff81285dcb>] ? xfs_trans_commit+0xb/0x10
[  707.833681]  [<ffffffff81203880>] ? iomap_write_end+0x70/0x70
[  707.833682]  [<ffffffff81203f5e>] iomap_apply+0xae/0x130
[  707.833683]  [<ffffffff81204043>] iomap_file_buffered_write+0x63/0xa0
[  707.833684]  [<ffffffff81203880>] ? iomap_write_end+0x70/0x70
[  707.833686]  [<ffffffff8126bd0f>] xfs_file_buffered_aio_write+0xcf/0x1f0
[  707.833689]  [<ffffffff816152a8>] ? _raw_spin_lock_irqsave+0x18/0x40
[  707.833690]  [<ffffffff81615053>] ? _raw_spin_unlock_irqrestore+0x13/0x30
[  707.833692]  [<ffffffff8126beb5>] xfs_file_write_iter+0x85/0x120
[  707.833694]  [<ffffffff811a802d>] __vfs_write+0xdd/0x140
[  707.833695]  [<ffffffff811a8c7d>] vfs_write+0xad/0x1a0
[  707.833697]  [<ffffffff810021f0>] ? syscall_trace_enter+0x1b0/0x240
[  707.833698]  [<ffffffff811aa090>] SyS_write+0x50/0xc0
[  707.833700]  [<ffffffff811d6b78>] ? do_fsync+0x38/0x60
[  707.833701]  [<ffffffff8100241c>] do_syscall_64+0x5c/0x170
[  707.833702]  [<ffffffff81615786>] entry_SYSCALL64_slow_path+0x25/0x25
[  707.833703] Mem-Info:
[  707.833706] active_anon:451061 inactive_anon:2097 isolated_anon:0
[  707.833706]  active_file:13 inactive_file:115 isolated_file:27
[  707.833706]  unevictable:0 dirty:80 writeback:1 unstable:0
[  707.833706]  slab_reclaimable:3291 slab_unreclaimable:21028
[  707.833706]  mapped:416 shmem:2162 pagetables:3734 bounce:0
[  707.833706]  free:13182 free_pcp:125 free_cma:0
[  707.833708] Node 0 active_anon:1804244kB inactive_anon:8388kB active_file:52kB inactive_file:460kB unevictable:0kB isolated(anon):0kB isolated(file):108kB mapped:1664kB dirty:320kB writeback:4kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 1472512kB anon_thp: 8648kB writeback_tmp:0kB unstable:0kB pages_scanned:1255 all_unreclaimable? yes
[  707.833710] Node 0 DMA free:8192kB min:352kB low:440kB high:528kB active_anon:7656kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  707.833712] lowmem_reserve[]: 0 1963 1963 1963
[  707.833714] Node 0 DMA32 free:44536kB min:44700kB low:55872kB high:67044kB active_anon:1796588kB inactive_anon:8388kB active_file:52kB inactive_file:460kB unevictable:0kB writepending:324kB present:2080640kB managed:2010816kB mlocked:0kB slab_reclaimable:13164kB slab_unreclaimable:84080kB kernel_stack:7312kB pagetables:14912kB bounce:0kB free_pcp:500kB local_pcp:168kB free_cma:0kB
[  707.833715] lowmem_reserve[]: 0 0 0 0
[  707.833720] Node 0 DMA: 4*4kB (M) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (M) 1*1024kB (U) 1*2048kB (M) 1*4096kB (M) = 8192kB
[  707.833725] Node 0 DMA32: 4*4kB (UME) 41*8kB (MEH) 692*16kB (UME) 653*32kB (UME) 135*64kB (UMEH) 16*128kB (UMH) 2*256kB (H) 0*512kB 1*1024kB (H) 0*2048kB 0*4096kB = 44536kB
[  707.833726] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  707.833727] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  707.833727] 2317 total pagecache pages
[  707.833728] 0 pages in swap cache
[  707.833729] Swap cache stats: add 0, delete 0, find 0/0
[  707.833729] Free swap  = 0kB
[  707.833729] Total swap = 0kB
[  707.833730] 524157 pages RAM
[  707.833730] 0 pages HighMem/MovableOnly
[  707.833730] 17477 pages reserved
[  707.833730] 0 pages hwpoisoned
---------- From http://I-love.SAKURA.ne.jp/tmp/serial-20161017-xfs-loop.txt.xz ----------

But that commit does not cover all possibilities caused by memory
allocation stalls. For example, without below patch,

----------
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 744f926..bbd0769 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1554,7 +1554,7 @@ int isolate_lru_page(struct page *page)
  * the LRU list will go small and be scanned faster than necessary, leading to
  * unnecessary swapping, thrashing and OOM.
  */
-static int too_many_isolated(struct pglist_data *pgdat, int file,
+static long too_many_isolated(struct pglist_data *pgdat, int file,
 		struct scan_control *sc)
 {
 	unsigned long inactive, isolated;
@@ -1581,7 +1581,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 	if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
 		inactive >>= 3;
 
-	return isolated > inactive;
+	return isolated - inactive;
 }
 
 static noinline_for_stack void
@@ -1697,11 +1697,25 @@ static bool inactive_reclaimable_pages(struct lruvec *lruvec,
 	int file = is_file_lru(lru);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	unsigned long wait_start = jiffies;
+	unsigned int wait_timeout = 10 * HZ;
+	long last_diff = 0;
+	long diff;
 
 	if (!inactive_reclaimable_pages(lruvec, sc, lru))
 		return 0;
 
-	while (unlikely(too_many_isolated(pgdat, file, sc))) {
+	while (unlikely((diff = too_many_isolated(pgdat, file, sc)) > 0)) {
+		if (diff < last_diff) {
+			wait_start = jiffies;
+			wait_timeout = 10 * HZ;
+		} else if (time_after(jiffies, wait_start + wait_timeout)) {
+			warn_alloc(sc->gfp_mask,
+				   "shrink_inactive_list() stalls for %ums",
+				   jiffies_to_msecs(jiffies - wait_start));
+			wait_timeout += 10 * HZ;
+		}
+		last_diff = diff;
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/* We are about to die and free our memory. Return now. */
----------

we cannot report a OOM livelock (shown below) where all ___GFP_DIRECT_RECLAIM
allocation requests got stuck at too_many_isolated() from shrink_inactive_list()
waiting for kswapd which got stuck waiting for a lock.

---------- From http://I-love.SAKURA.ne.jp/tmp/serial-20161017-shrink-loop.txt.xz ----------
[  853.591933] oom_reaper: reaped process 7091 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  888.994101] a.out: shrink_inactive_list() stalls for 10032ms, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  888.996601] CPU: 2 PID: 7107 Comm: a.out Tainted: G        W       4.9.0-rc1+ #80
[  888.998543] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  889.001075]  ffffc90007ecf788 ffffffff812e9777 ffffffff8197ce18 0000000000000000
[  889.003140]  ffffc90007ecf810 ffffffff8112a114 024201ca00000064 ffffffff8197ce18
[  889.005216]  ffffc90007ecf7b0 ffffc90000000010 ffffc90007ecf820 ffffc90007ecf7d0
[  889.007289] Call Trace:
[  889.008295]  [<ffffffff812e9777>] dump_stack+0x4f/0x68
[  889.009842]  [<ffffffff8112a114>] warn_alloc+0x144/0x160
[  889.011389]  [<ffffffff810a4b40>] ? wake_up_bit+0x30/0x30
[  889.012956]  [<ffffffff81137af3>] shrink_inactive_list+0x593/0x5a0
[  889.014659]  [<ffffffff81138389>] shrink_node_memcg+0x509/0x7b0
[  889.016330]  [<ffffffff811ab200>] ? super_cache_count+0x30/0xd0
[  889.018008]  [<ffffffff8113870c>] shrink_node+0xdc/0x320
[  889.019564]  [<ffffffff81138c56>] do_try_to_free_pages+0xd6/0x330
[  889.021276]  [<ffffffff81138f6b>] try_to_free_pages+0xbb/0xf0
[  889.022937]  [<ffffffff8112a8b6>] __alloc_pages_nodemask+0x716/0xe80
[  889.024684]  [<ffffffff812c2197>] ? blk_finish_plug+0x27/0x40
[  889.026322]  [<ffffffff812efa04>] ? __radix_tree_lookup+0x84/0xf0
[  889.028019]  [<ffffffff81177f07>] alloc_pages_current+0x87/0x110
[  889.029706]  [<ffffffff8111f33c>] __page_cache_alloc+0xdc/0x120
[  889.031392]  [<ffffffff81123233>] filemap_fault+0x333/0x570
[  889.033026]  [<ffffffff8126b519>] xfs_filemap_fault+0x39/0x60
[  889.034668]  [<ffffffff8114f774>] __do_fault+0x74/0x180
[  889.036218]  [<ffffffff811559f2>] handle_mm_fault+0xe82/0x1660
[  889.037878]  [<ffffffff8104da40>] __do_page_fault+0x180/0x550
[  889.039493]  [<ffffffff8104de31>] do_page_fault+0x21/0x70
[  889.040963]  [<ffffffff81002525>] ? do_syscall_64+0x165/0x170
[  889.042526]  [<ffffffff81616db2>] page_fault+0x22/0x30
[  889.044118] a.out: shrink_inactive_list() stalls for 10082ms[  889.044789] Mem-Info:
[  889.044793] active_anon:390112 inactive_anon:3030 isolated_anon:0
[  889.044793]  active_file:63 inactive_file:66 isolated_file:32
[  889.044793]  unevictable:0 dirty:5 writeback:3 unstable:0
[  889.044793]  slab_reclaimable:3306 slab_unreclaimable:17523
[  889.044793]  mapped:1012 shmem:4210 pagetables:2823 bounce:0
[  889.044793]  free:13235 free_pcp:31 free_cma:0
[  889.044796] Node 0 active_anon:1560448kB inactive_anon:12120kB active_file:252kB inactive_file:264kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:4048kB dirty:20kB writeback:12kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 1095680kB anon_thp: 16840kB writeback_tmp:0kB unstable:0kB pages_scanned:995 all_unreclaimable? yes
[  889.044796] Node 0 
[  889.044799] DMA free:7208kB min:404kB low:504kB high:604kB active_anon:8456kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:224kB kernel_stack:16kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]:
[  889.044799]  0 1707 1707 1707
Node 0 
[  889.044803] DMA32 free:45732kB min:44648kB low:55808kB high:66968kB active_anon:1551992kB inactive_anon:12120kB active_file:252kB inactive_file:264kB unevictable:0kB writepending:32kB present:2080640kB managed:1748672kB mlocked:0kB slab_reclaimable:13224kB slab_unreclaimable:69868kB kernel_stack:5584kB pagetables:11292kB bounce:0kB free_pcp:124kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]:
[  889.044803]  0 0 0 0
Node 0 
[  889.044805] DMA: 0*4kB 1*8kB (M) 4*16kB (UM) 5*32kB (UM) 9*64kB (UM) 4*128kB (UM) 3*256kB (U) 4*512kB (UM) 1*1024kB (M) 1*2048kB (E) 0*4096kB = 7208kB
Node 0 
[  889.044811] DMA32: 873*4kB (UME) 1010*8kB (UMEH) 717*16kB (UMEH) 445*32kB (UMEH) 100*64kB (UMH) 8*128kB (UMH) 0*256kB 0*512kB 1*1024kB (H) 0*2048kB 0*4096kB = 45732kB
[  889.044817] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  889.044818] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  889.044818] 4379 total pagecache pages
[  889.044820] 0 pages in swap cache
[  889.044820] Swap cache stats: add 0, delete 0, find 0/0
[  889.044821] Free swap  = 0kB
[  889.044821] Total swap = 0kB
[  889.044821] 524157 pages RAM
[  889.044821] 0 pages HighMem/MovableOnly
[  889.044822] 83013 pages reserved
[  889.044822] 0 pages hwpoisoned
(...snipped...)
[  939.150914] INFO: task kswapd0:60 blocked for more than 60 seconds.
[  939.152922]       Tainted: G        W       4.9.0-rc1+ #80
[  939.154891] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  939.157296] kswapd0         D ffffffff816111f7     0    60      2 0x00000000
[  939.159659]  ffff88007a8c9438 ffff880077ff34c0 ffff88007ac93c40 ffff88007a8c8f40
[  939.162131]  ffff88007f816d80 ffffc9000053b780 ffffffff816111f7 000000009b1d7cf9
[  939.164582]  ffff88007a8c8f40 ffff8800776fda18 ffffc9000053b7b0 ffff8800776fda30
[  939.167053] Call Trace:
[  939.168450]  [<ffffffff816111f7>] ? __schedule+0x177/0x550
[  939.170417]  [<ffffffff8161160b>] schedule+0x3b/0x90
[  939.172285]  [<ffffffff81614064>] rwsem_down_read_failed+0xf4/0x160
[  939.174411]  [<ffffffff812bf7ec>] ? get_request+0x43c/0x770
[  939.176429]  [<ffffffff812f6818>] call_rwsem_down_read_failed+0x18/0x30
[  939.178615]  [<ffffffff816133c2>] down_read+0x12/0x30
[  939.180544]  [<ffffffff81277dae>] xfs_ilock+0x3e/0xa0
[  939.182427]  [<ffffffff81261a70>] xfs_map_blocks+0x80/0x180
[  939.184415]  [<ffffffff81262bd8>] xfs_do_writepage+0x1c8/0x710
[  939.186458]  [<ffffffff81261ec9>] ? xfs_setfilesize_trans_alloc.isra.31+0x39/0x90
[  939.189711]  [<ffffffff81263156>] xfs_vm_writepage+0x36/0x70
[  939.192094]  [<ffffffff81134a47>] pageout.isra.42+0x1a7/0x2b0
[  939.194191]  [<ffffffff81136b47>] shrink_page_list+0x7c7/0xb70
[  939.196254]  [<ffffffff81137798>] shrink_inactive_list+0x238/0x5a0
[  939.198541]  [<ffffffff81138389>] shrink_node_memcg+0x509/0x7b0
[  939.200611]  [<ffffffff8113870c>] shrink_node+0xdc/0x320
[  939.202522]  [<ffffffff8113949a>] kswapd+0x2ca/0x620
[  939.204271]  [<ffffffff811391d0>] ? mem_cgroup_shrink_node+0xb0/0xb0
[  939.206243]  [<ffffffff81081234>] kthread+0xd4/0xf0
[  939.207904]  [<ffffffff81081160>] ? kthread_park+0x60/0x60
[  939.209848]  [<ffffffff81615922>] ret_from_fork+0x22/0x30
---------- From http://I-love.SAKURA.ne.jp/tmp/serial-20161017-shrink-loop.txt.xz ----------

This means that, unless we scatter around warn_alloc() to all locations
which might depend on somebody else to make forward progress, we may
fail to get a clue.

The code will look messy if we scatter around warn_alloc() calls.
Also, it is more likely that multiple concurrent warn_alloc() calls race.
Even if we guard warn_alloc() with mutex_lock(&oom_lock)/mutex_unlock(&oom_lock),
messages from hung task watchdog and messages from warn_alloc() calls will race.

There is an alternative approach I proposed at
http://lkml.kernel.org/r/1462630604-23410-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
which serializes both hung task watchdog and warn_alloc().

So, how can we make warn_alloc() reliable?

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* How to make warn_alloc() reliable?
@ 2016-10-18 11:04 ` Tetsuo Handa
  0 siblings, 0 replies; 12+ messages in thread
From: Tetsuo Handa @ 2016-10-18 11:04 UTC (permalink / raw)
  To: mhocko, akpm, hannes, mgorman, dave.hansen; +Cc: linux-mm, linux-kernel

Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for
too long") is a great step for reducing possibility of silent hang up
problem caused by memory allocation stalls. For example, below is a
report where write() request got stuck because it cannot invoke the
OOM killer due to GFP_NOFS allocation request.

---------- From http://I-love.SAKURA.ne.jp/tmp/serial-20161017-xfs-loop.txt.xz ----------
[  351.824548] oom_reaper: reaped process 4727 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  362.309509] warn_alloc: 96 callbacks suppressed
(...snipped...)
[  707.833650] a.out: page alloction stalls for 370009ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[  707.833653] CPU: 3 PID: 4746 Comm: a.out Tainted: G        W       4.9.0-rc1+ #80
[  707.833653] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  707.833656]  ffffc90006d27950 ffffffff812e9777 ffffffff8197c438 0000000000000001
[  707.833657]  ffffc90006d279d8 ffffffff8112a114 0342004a7fffd720 ffffffff8197c438
[  707.833658]  ffffc90006d27978 ffffffff00000010 ffffc90006d279e8 ffffc90006d27998
[  707.833658] Call Trace:
[  707.833662]  [<ffffffff812e9777>] dump_stack+0x4f/0x68
[  707.833665]  [<ffffffff8112a114>] warn_alloc+0x144/0x160
[  707.833666]  [<ffffffff8112aad6>] __alloc_pages_nodemask+0x936/0xe80
[  707.833670]  [<ffffffff81177f07>] alloc_pages_current+0x87/0x110
[  707.833672]  [<ffffffff8111f33c>] __page_cache_alloc+0xdc/0x120
[  707.833673]  [<ffffffff8111fe58>] pagecache_get_page+0x88/0x2b0
[  707.833675]  [<ffffffff81120f5b>] grab_cache_page_write_begin+0x1b/0x40
[  707.833677]  [<ffffffff812036ab>] iomap_write_begin+0x4b/0x100
[  707.833678]  [<ffffffff81203932>] iomap_write_actor+0xb2/0x190
[  707.833680]  [<ffffffff81285dcb>] ? xfs_trans_commit+0xb/0x10
[  707.833681]  [<ffffffff81203880>] ? iomap_write_end+0x70/0x70
[  707.833682]  [<ffffffff81203f5e>] iomap_apply+0xae/0x130
[  707.833683]  [<ffffffff81204043>] iomap_file_buffered_write+0x63/0xa0
[  707.833684]  [<ffffffff81203880>] ? iomap_write_end+0x70/0x70
[  707.833686]  [<ffffffff8126bd0f>] xfs_file_buffered_aio_write+0xcf/0x1f0
[  707.833689]  [<ffffffff816152a8>] ? _raw_spin_lock_irqsave+0x18/0x40
[  707.833690]  [<ffffffff81615053>] ? _raw_spin_unlock_irqrestore+0x13/0x30
[  707.833692]  [<ffffffff8126beb5>] xfs_file_write_iter+0x85/0x120
[  707.833694]  [<ffffffff811a802d>] __vfs_write+0xdd/0x140
[  707.833695]  [<ffffffff811a8c7d>] vfs_write+0xad/0x1a0
[  707.833697]  [<ffffffff810021f0>] ? syscall_trace_enter+0x1b0/0x240
[  707.833698]  [<ffffffff811aa090>] SyS_write+0x50/0xc0
[  707.833700]  [<ffffffff811d6b78>] ? do_fsync+0x38/0x60
[  707.833701]  [<ffffffff8100241c>] do_syscall_64+0x5c/0x170
[  707.833702]  [<ffffffff81615786>] entry_SYSCALL64_slow_path+0x25/0x25
[  707.833703] Mem-Info:
[  707.833706] active_anon:451061 inactive_anon:2097 isolated_anon:0
[  707.833706]  active_file:13 inactive_file:115 isolated_file:27
[  707.833706]  unevictable:0 dirty:80 writeback:1 unstable:0
[  707.833706]  slab_reclaimable:3291 slab_unreclaimable:21028
[  707.833706]  mapped:416 shmem:2162 pagetables:3734 bounce:0
[  707.833706]  free:13182 free_pcp:125 free_cma:0
[  707.833708] Node 0 active_anon:1804244kB inactive_anon:8388kB active_file:52kB inactive_file:460kB unevictable:0kB isolated(anon):0kB isolated(file):108kB mapped:1664kB dirty:320kB writeback:4kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 1472512kB anon_thp: 8648kB writeback_tmp:0kB unstable:0kB pages_scanned:1255 all_unreclaimable? yes
[  707.833710] Node 0 DMA free:8192kB min:352kB low:440kB high:528kB active_anon:7656kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  707.833712] lowmem_reserve[]: 0 1963 1963 1963
[  707.833714] Node 0 DMA32 free:44536kB min:44700kB low:55872kB high:67044kB active_anon:1796588kB inactive_anon:8388kB active_file:52kB inactive_file:460kB unevictable:0kB writepending:324kB present:2080640kB managed:2010816kB mlocked:0kB slab_reclaimable:13164kB slab_unreclaimable:84080kB kernel_stack:7312kB pagetables:14912kB bounce:0kB free_pcp:500kB local_pcp:168kB free_cma:0kB
[  707.833715] lowmem_reserve[]: 0 0 0 0
[  707.833720] Node 0 DMA: 4*4kB (M) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (M) 1*1024kB (U) 1*2048kB (M) 1*4096kB (M) = 8192kB
[  707.833725] Node 0 DMA32: 4*4kB (UME) 41*8kB (MEH) 692*16kB (UME) 653*32kB (UME) 135*64kB (UMEH) 16*128kB (UMH) 2*256kB (H) 0*512kB 1*1024kB (H) 0*2048kB 0*4096kB = 44536kB
[  707.833726] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  707.833727] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  707.833727] 2317 total pagecache pages
[  707.833728] 0 pages in swap cache
[  707.833729] Swap cache stats: add 0, delete 0, find 0/0
[  707.833729] Free swap  = 0kB
[  707.833729] Total swap = 0kB
[  707.833730] 524157 pages RAM
[  707.833730] 0 pages HighMem/MovableOnly
[  707.833730] 17477 pages reserved
[  707.833730] 0 pages hwpoisoned
---------- From http://I-love.SAKURA.ne.jp/tmp/serial-20161017-xfs-loop.txt.xz ----------

But that commit does not cover all possibilities caused by memory
allocation stalls. For example, without below patch,

----------
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 744f926..bbd0769 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1554,7 +1554,7 @@ int isolate_lru_page(struct page *page)
  * the LRU list will go small and be scanned faster than necessary, leading to
  * unnecessary swapping, thrashing and OOM.
  */
-static int too_many_isolated(struct pglist_data *pgdat, int file,
+static long too_many_isolated(struct pglist_data *pgdat, int file,
 		struct scan_control *sc)
 {
 	unsigned long inactive, isolated;
@@ -1581,7 +1581,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 	if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
 		inactive >>= 3;
 
-	return isolated > inactive;
+	return isolated - inactive;
 }
 
 static noinline_for_stack void
@@ -1697,11 +1697,25 @@ static bool inactive_reclaimable_pages(struct lruvec *lruvec,
 	int file = is_file_lru(lru);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	unsigned long wait_start = jiffies;
+	unsigned int wait_timeout = 10 * HZ;
+	long last_diff = 0;
+	long diff;
 
 	if (!inactive_reclaimable_pages(lruvec, sc, lru))
 		return 0;
 
-	while (unlikely(too_many_isolated(pgdat, file, sc))) {
+	while (unlikely((diff = too_many_isolated(pgdat, file, sc)) > 0)) {
+		if (diff < last_diff) {
+			wait_start = jiffies;
+			wait_timeout = 10 * HZ;
+		} else if (time_after(jiffies, wait_start + wait_timeout)) {
+			warn_alloc(sc->gfp_mask,
+				   "shrink_inactive_list() stalls for %ums",
+				   jiffies_to_msecs(jiffies - wait_start));
+			wait_timeout += 10 * HZ;
+		}
+		last_diff = diff;
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/* We are about to die and free our memory. Return now. */
----------

we cannot report a OOM livelock (shown below) where all ___GFP_DIRECT_RECLAIM
allocation requests got stuck at too_many_isolated() from shrink_inactive_list()
waiting for kswapd which got stuck waiting for a lock.

---------- From http://I-love.SAKURA.ne.jp/tmp/serial-20161017-shrink-loop.txt.xz ----------
[  853.591933] oom_reaper: reaped process 7091 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  888.994101] a.out: shrink_inactive_list() stalls for 10032ms, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
[  888.996601] CPU: 2 PID: 7107 Comm: a.out Tainted: G        W       4.9.0-rc1+ #80
[  888.998543] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  889.001075]  ffffc90007ecf788 ffffffff812e9777 ffffffff8197ce18 0000000000000000
[  889.003140]  ffffc90007ecf810 ffffffff8112a114 024201ca00000064 ffffffff8197ce18
[  889.005216]  ffffc90007ecf7b0 ffffc90000000010 ffffc90007ecf820 ffffc90007ecf7d0
[  889.007289] Call Trace:
[  889.008295]  [<ffffffff812e9777>] dump_stack+0x4f/0x68
[  889.009842]  [<ffffffff8112a114>] warn_alloc+0x144/0x160
[  889.011389]  [<ffffffff810a4b40>] ? wake_up_bit+0x30/0x30
[  889.012956]  [<ffffffff81137af3>] shrink_inactive_list+0x593/0x5a0
[  889.014659]  [<ffffffff81138389>] shrink_node_memcg+0x509/0x7b0
[  889.016330]  [<ffffffff811ab200>] ? super_cache_count+0x30/0xd0
[  889.018008]  [<ffffffff8113870c>] shrink_node+0xdc/0x320
[  889.019564]  [<ffffffff81138c56>] do_try_to_free_pages+0xd6/0x330
[  889.021276]  [<ffffffff81138f6b>] try_to_free_pages+0xbb/0xf0
[  889.022937]  [<ffffffff8112a8b6>] __alloc_pages_nodemask+0x716/0xe80
[  889.024684]  [<ffffffff812c2197>] ? blk_finish_plug+0x27/0x40
[  889.026322]  [<ffffffff812efa04>] ? __radix_tree_lookup+0x84/0xf0
[  889.028019]  [<ffffffff81177f07>] alloc_pages_current+0x87/0x110
[  889.029706]  [<ffffffff8111f33c>] __page_cache_alloc+0xdc/0x120
[  889.031392]  [<ffffffff81123233>] filemap_fault+0x333/0x570
[  889.033026]  [<ffffffff8126b519>] xfs_filemap_fault+0x39/0x60
[  889.034668]  [<ffffffff8114f774>] __do_fault+0x74/0x180
[  889.036218]  [<ffffffff811559f2>] handle_mm_fault+0xe82/0x1660
[  889.037878]  [<ffffffff8104da40>] __do_page_fault+0x180/0x550
[  889.039493]  [<ffffffff8104de31>] do_page_fault+0x21/0x70
[  889.040963]  [<ffffffff81002525>] ? do_syscall_64+0x165/0x170
[  889.042526]  [<ffffffff81616db2>] page_fault+0x22/0x30
[  889.044118] a.out: shrink_inactive_list() stalls for 10082ms[  889.044789] Mem-Info:
[  889.044793] active_anon:390112 inactive_anon:3030 isolated_anon:0
[  889.044793]  active_file:63 inactive_file:66 isolated_file:32
[  889.044793]  unevictable:0 dirty:5 writeback:3 unstable:0
[  889.044793]  slab_reclaimable:3306 slab_unreclaimable:17523
[  889.044793]  mapped:1012 shmem:4210 pagetables:2823 bounce:0
[  889.044793]  free:13235 free_pcp:31 free_cma:0
[  889.044796] Node 0 active_anon:1560448kB inactive_anon:12120kB active_file:252kB inactive_file:264kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:4048kB dirty:20kB writeback:12kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 1095680kB anon_thp: 16840kB writeback_tmp:0kB unstable:0kB pages_scanned:995 all_unreclaimable? yes
[  889.044796] Node 0 
[  889.044799] DMA free:7208kB min:404kB low:504kB high:604kB active_anon:8456kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:224kB kernel_stack:16kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]:
[  889.044799]  0 1707 1707 1707
Node 0 
[  889.044803] DMA32 free:45732kB min:44648kB low:55808kB high:66968kB active_anon:1551992kB inactive_anon:12120kB active_file:252kB inactive_file:264kB unevictable:0kB writepending:32kB present:2080640kB managed:1748672kB mlocked:0kB slab_reclaimable:13224kB slab_unreclaimable:69868kB kernel_stack:5584kB pagetables:11292kB bounce:0kB free_pcp:124kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]:
[  889.044803]  0 0 0 0
Node 0 
[  889.044805] DMA: 0*4kB 1*8kB (M) 4*16kB (UM) 5*32kB (UM) 9*64kB (UM) 4*128kB (UM) 3*256kB (U) 4*512kB (UM) 1*1024kB (M) 1*2048kB (E) 0*4096kB = 7208kB
Node 0 
[  889.044811] DMA32: 873*4kB (UME) 1010*8kB (UMEH) 717*16kB (UMEH) 445*32kB (UMEH) 100*64kB (UMH) 8*128kB (UMH) 0*256kB 0*512kB 1*1024kB (H) 0*2048kB 0*4096kB = 45732kB
[  889.044817] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  889.044818] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  889.044818] 4379 total pagecache pages
[  889.044820] 0 pages in swap cache
[  889.044820] Swap cache stats: add 0, delete 0, find 0/0
[  889.044821] Free swap  = 0kB
[  889.044821] Total swap = 0kB
[  889.044821] 524157 pages RAM
[  889.044821] 0 pages HighMem/MovableOnly
[  889.044822] 83013 pages reserved
[  889.044822] 0 pages hwpoisoned
(...snipped...)
[  939.150914] INFO: task kswapd0:60 blocked for more than 60 seconds.
[  939.152922]       Tainted: G        W       4.9.0-rc1+ #80
[  939.154891] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  939.157296] kswapd0         D ffffffff816111f7     0    60      2 0x00000000
[  939.159659]  ffff88007a8c9438 ffff880077ff34c0 ffff88007ac93c40 ffff88007a8c8f40
[  939.162131]  ffff88007f816d80 ffffc9000053b780 ffffffff816111f7 000000009b1d7cf9
[  939.164582]  ffff88007a8c8f40 ffff8800776fda18 ffffc9000053b7b0 ffff8800776fda30
[  939.167053] Call Trace:
[  939.168450]  [<ffffffff816111f7>] ? __schedule+0x177/0x550
[  939.170417]  [<ffffffff8161160b>] schedule+0x3b/0x90
[  939.172285]  [<ffffffff81614064>] rwsem_down_read_failed+0xf4/0x160
[  939.174411]  [<ffffffff812bf7ec>] ? get_request+0x43c/0x770
[  939.176429]  [<ffffffff812f6818>] call_rwsem_down_read_failed+0x18/0x30
[  939.178615]  [<ffffffff816133c2>] down_read+0x12/0x30
[  939.180544]  [<ffffffff81277dae>] xfs_ilock+0x3e/0xa0
[  939.182427]  [<ffffffff81261a70>] xfs_map_blocks+0x80/0x180
[  939.184415]  [<ffffffff81262bd8>] xfs_do_writepage+0x1c8/0x710
[  939.186458]  [<ffffffff81261ec9>] ? xfs_setfilesize_trans_alloc.isra.31+0x39/0x90
[  939.189711]  [<ffffffff81263156>] xfs_vm_writepage+0x36/0x70
[  939.192094]  [<ffffffff81134a47>] pageout.isra.42+0x1a7/0x2b0
[  939.194191]  [<ffffffff81136b47>] shrink_page_list+0x7c7/0xb70
[  939.196254]  [<ffffffff81137798>] shrink_inactive_list+0x238/0x5a0
[  939.198541]  [<ffffffff81138389>] shrink_node_memcg+0x509/0x7b0
[  939.200611]  [<ffffffff8113870c>] shrink_node+0xdc/0x320
[  939.202522]  [<ffffffff8113949a>] kswapd+0x2ca/0x620
[  939.204271]  [<ffffffff811391d0>] ? mem_cgroup_shrink_node+0xb0/0xb0
[  939.206243]  [<ffffffff81081234>] kthread+0xd4/0xf0
[  939.207904]  [<ffffffff81081160>] ? kthread_park+0x60/0x60
[  939.209848]  [<ffffffff81615922>] ret_from_fork+0x22/0x30
---------- From http://I-love.SAKURA.ne.jp/tmp/serial-20161017-shrink-loop.txt.xz ----------

This means that, unless we scatter around warn_alloc() to all locations
which might depend on somebody else to make forward progress, we may
fail to get a clue.

The code will look messy if we scatter around warn_alloc() calls.
Also, it is more likely that multiple concurrent warn_alloc() calls race.
Even if we guard warn_alloc() with mutex_lock(&oom_lock)/mutex_unlock(&oom_lock),
messages from hung task watchdog and messages from warn_alloc() calls will race.

There is an alternative approach I proposed at
http://lkml.kernel.org/r/1462630604-23410-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
which serializes both hung task watchdog and warn_alloc().

So, how can we make warn_alloc() reliable?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: How to make warn_alloc() reliable?
  2016-10-18 11:04 ` Tetsuo Handa
@ 2016-10-18 12:27   ` Michal Hocko
  -1 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2016-10-18 12:27 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, hannes, mgorman, dave.hansen, linux-mm, linux-kernel

On Tue 18-10-16 20:04:20, Tetsuo Handa wrote:
[...]
> @@ -1697,11 +1697,25 @@ static bool inactive_reclaimable_pages(struct lruvec *lruvec,
>  	int file = is_file_lru(lru);
>  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> +	unsigned long wait_start = jiffies;
> +	unsigned int wait_timeout = 10 * HZ;
> +	long last_diff = 0;
> +	long diff;
>  
>  	if (!inactive_reclaimable_pages(lruvec, sc, lru))
>  		return 0;
>  
> -	while (unlikely(too_many_isolated(pgdat, file, sc))) {
> +	while (unlikely((diff = too_many_isolated(pgdat, file, sc)) > 0)) {
> +		if (diff < last_diff) {
> +			wait_start = jiffies;
> +			wait_timeout = 10 * HZ;
> +		} else if (time_after(jiffies, wait_start + wait_timeout)) {
> +			warn_alloc(sc->gfp_mask,
> +				   "shrink_inactive_list() stalls for %ums",
> +				   jiffies_to_msecs(jiffies - wait_start));
> +			wait_timeout += 10 * HZ;
> +		}
> +		last_diff = diff;
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		/* We are about to die and free our memory. Return now. */
> ----------
[...]
> So, how can we make warn_alloc() reliable?

This is not about warn_alloc reliability but more about
too_many_isolated waiting for an unbounded amount of time. And that
should be fixed. I do not have a good idea how right now.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to make warn_alloc() reliable?
@ 2016-10-18 12:27   ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2016-10-18 12:27 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, hannes, mgorman, dave.hansen, linux-mm, linux-kernel

On Tue 18-10-16 20:04:20, Tetsuo Handa wrote:
[...]
> @@ -1697,11 +1697,25 @@ static bool inactive_reclaimable_pages(struct lruvec *lruvec,
>  	int file = is_file_lru(lru);
>  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> +	unsigned long wait_start = jiffies;
> +	unsigned int wait_timeout = 10 * HZ;
> +	long last_diff = 0;
> +	long diff;
>  
>  	if (!inactive_reclaimable_pages(lruvec, sc, lru))
>  		return 0;
>  
> -	while (unlikely(too_many_isolated(pgdat, file, sc))) {
> +	while (unlikely((diff = too_many_isolated(pgdat, file, sc)) > 0)) {
> +		if (diff < last_diff) {
> +			wait_start = jiffies;
> +			wait_timeout = 10 * HZ;
> +		} else if (time_after(jiffies, wait_start + wait_timeout)) {
> +			warn_alloc(sc->gfp_mask,
> +				   "shrink_inactive_list() stalls for %ums",
> +				   jiffies_to_msecs(jiffies - wait_start));
> +			wait_timeout += 10 * HZ;
> +		}
> +		last_diff = diff;
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		/* We are about to die and free our memory. Return now. */
> ----------
[...]
> So, how can we make warn_alloc() reliable?

This is not about warn_alloc reliability but more about
too_many_isolated waiting for an unbounded amount of time. And that
should be fixed. I do not have a good idea how right now.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to make warn_alloc() reliable?
  2016-10-18 12:27   ` Michal Hocko
@ 2016-10-19 11:27     ` Tetsuo Handa
  -1 siblings, 0 replies; 12+ messages in thread
From: Tetsuo Handa @ 2016-10-19 11:27 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, hannes, mgorman, dave.hansen, linux-mm, linux-kernel

Michal Hocko wrote:
> This is not about warn_alloc reliability but more about
> too_many_isolated waiting for an unbounded amount of time. And that
> should be fixed. I do not have a good idea how right now.

I'm not talking about only too_many_isolated() case. If I were talking about
this specific case, I would have proposed leaving this loop using timeout.
For example, where is the guarantee that current thread never get stuck
at shrink_inactive_list() after leaving this too_many_isolated() loop?

I think that perception of ordinary Linux user's memory management is
"Linux reclaims memory when needed. Thus, it is normal that MemFree:
field of /proc/meminfo is small." and "Linux invokes the OOM killer if
memory allocation request can't make forward progress". However we know
"Linux may not be able to invoke the OOM killer even if memory allocation
request can't make forward progress". You suddenly bring up (or admit to)
implications/limitations/problems most Linux users do not know. That's
painful for me who went to a lot of trouble to get some clue at a support
center.

When we were off-list talking about CVE-2016-2847, your response had been
"Your machine is DoSed already" until we notice the "too small to fail"
memory-allocation rule. If I were not continuing examining until I make
you angry, we would not have come to correct answer. I don't like your
optimistic "Fix it if you can trigger it." approach which will never give
users (and troubleshooting staffs at support centers) a proof. I want a
"Expose what Michal Hocko is not aware of or does not care" mechanism.

What I'm talking about is "why don't you stop playing whack-a-mole games
with missing warn_alloc() calls". I don't blame you for not having a good
idea, but I blame you for not having a reliable warn_alloc() mechanism.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to make warn_alloc() reliable?
@ 2016-10-19 11:27     ` Tetsuo Handa
  0 siblings, 0 replies; 12+ messages in thread
From: Tetsuo Handa @ 2016-10-19 11:27 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, hannes, mgorman, dave.hansen, linux-mm, linux-kernel

Michal Hocko wrote:
> This is not about warn_alloc reliability but more about
> too_many_isolated waiting for an unbounded amount of time. And that
> should be fixed. I do not have a good idea how right now.

I'm not talking about only too_many_isolated() case. If I were talking about
this specific case, I would have proposed leaving this loop using timeout.
For example, where is the guarantee that current thread never get stuck
at shrink_inactive_list() after leaving this too_many_isolated() loop?

I think that perception of ordinary Linux user's memory management is
"Linux reclaims memory when needed. Thus, it is normal that MemFree:
field of /proc/meminfo is small." and "Linux invokes the OOM killer if
memory allocation request can't make forward progress". However we know
"Linux may not be able to invoke the OOM killer even if memory allocation
request can't make forward progress". You suddenly bring up (or admit to)
implications/limitations/problems most Linux users do not know. That's
painful for me who went to a lot of trouble to get some clue at a support
center.

When we were off-list talking about CVE-2016-2847, your response had been
"Your machine is DoSed already" until we notice the "too small to fail"
memory-allocation rule. If I were not continuing examining until I make
you angry, we would not have come to correct answer. I don't like your
optimistic "Fix it if you can trigger it." approach which will never give
users (and troubleshooting staffs at support centers) a proof. I want a
"Expose what Michal Hocko is not aware of or does not care" mechanism.

What I'm talking about is "why don't you stop playing whack-a-mole games
with missing warn_alloc() calls". I don't blame you for not having a good
idea, but I blame you for not having a reliable warn_alloc() mechanism.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to make warn_alloc() reliable?
  2016-10-19 11:27     ` Tetsuo Handa
@ 2016-10-19 11:55       ` Michal Hocko
  -1 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2016-10-19 11:55 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, hannes, mgorman, dave.hansen, linux-mm, linux-kernel

On Wed 19-10-16 20:27:53, Tetsuo Handa wrote:
[...]
> What I'm talking about is "why don't you stop playing whack-a-mole games
> with missing warn_alloc() calls". I don't blame you for not having a good
> idea, but I blame you for not having a reliable warn_alloc() mechanism.

Look, it seems pretty clear that our priorities and viewes are quite
different. While I believe that we should solve real issues in a
reliable and robust way you seem to love to be have as much reporting as
possible. I do agree that reporting is important part of debugging of
problems but as your previous attempts for the allocation watchdog show
a proper and bullet proof reporting requires state tracking and is in
general too complex for something that doesn't happen in most properly
configured systems. Maybe there are other ways but my time is better
spent on something more useful - like making the direct reclaim path
more deterministic without any unbound loops.

So let's agree to disagree about importance of the reliability
warn_alloc. I see it as an improvement which doesn't really have to be
perfect.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to make warn_alloc() reliable?
@ 2016-10-19 11:55       ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2016-10-19 11:55 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, hannes, mgorman, dave.hansen, linux-mm, linux-kernel

On Wed 19-10-16 20:27:53, Tetsuo Handa wrote:
[...]
> What I'm talking about is "why don't you stop playing whack-a-mole games
> with missing warn_alloc() calls". I don't blame you for not having a good
> idea, but I blame you for not having a reliable warn_alloc() mechanism.

Look, it seems pretty clear that our priorities and viewes are quite
different. While I believe that we should solve real issues in a
reliable and robust way you seem to love to be have as much reporting as
possible. I do agree that reporting is important part of debugging of
problems but as your previous attempts for the allocation watchdog show
a proper and bullet proof reporting requires state tracking and is in
general too complex for something that doesn't happen in most properly
configured systems. Maybe there are other ways but my time is better
spent on something more useful - like making the direct reclaim path
more deterministic without any unbound loops.

So let's agree to disagree about importance of the reliability
warn_alloc. I see it as an improvement which doesn't really have to be
perfect.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to make warn_alloc() reliable?
  2016-10-19 11:55       ` Michal Hocko
@ 2016-10-20 12:07         ` Tetsuo Handa
  -1 siblings, 0 replies; 12+ messages in thread
From: Tetsuo Handa @ 2016-10-20 12:07 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, hannes, mgorman, dave.hansen, linux-mm, linux-kernel

Michal Hocko wrote:
> On Wed 19-10-16 20:27:53, Tetsuo Handa wrote:
> [...]
> > What I'm talking about is "why don't you stop playing whack-a-mole games
> > with missing warn_alloc() calls". I don't blame you for not having a good
> > idea, but I blame you for not having a reliable warn_alloc() mechanism.
> 
> Look, it seems pretty clear that our priorities and viewes are quite
> different. While I believe that we should solve real issues in a
> reliable and robust way you seem to love to be have as much reporting as
> possible. I do agree that reporting is important part of debugging of
> problems but as your previous attempts for the allocation watchdog show
> a proper and bullet proof reporting requires state tracking and is in
> general too complex for something that doesn't happen in most properly
> configured systems. Maybe there are other ways but my time is better
> spent on something more useful - like making the direct reclaim path
> more deterministic without any unbound loops.

Properly configured systems should not be bothered by low memory situations.
There are systems which are bothered by low memory situations. It is pointless
to refer to "properly configured systems" as a reason not to add a watchdog.
It is administrators who decide whether to use a watchdog.

> 
> So let's agree to disagree about importance of the reliability
> warn_alloc. I see it as an improvement which doesn't really have to be
> perfect.

I don't expect kmallocwd alone to be perfect. I expect kmallocwd to serve
as a hook. For example, it will be possible to turn on collecting perf data
when kmallocwd found a stalling thread and turn off when kmallocwd found
none. Since necessary information are stored in the task struct, it will
be easy to include them into perf data. Likewise, it will be easy to
extract them using a script for /usr/bin/crash when an administrator
captured a vmcore image of a stalling KVM guest. Sending vmcore images
to support centers is difficult due to file size and security reasons.
It is nice if we can get a clue by reading the task list.

But warn_alloc() can't serve as a hook. I see kmallocwd as an improvement
which doesn't really have to be perfect.



By the way, regarding "making the direct reclaim path more deterministic"
part, I wish that we can

  (1) introduce phased watermarks which varies based on stage of reclaim
      operation (e.g. watermark_lower()/watermark_higher() which resembles
      preempt_disable()/preempt_enable() but is propagated to other threads
      when delegating operations needed for reclaim to other threads).

  (2) introduce dedicated kernel threads which perform only specific
      reclaim operation, using watermark propagated from other threads
      which performs different reclaim operation.

  (3) remove direct reclaim which bothers callers with managing correct
      GFP_NOIO / GFP_NOFS / GFP_KERNEL distinction. Then, normal
      ___GFP_DIRECT_RECLAIM callers can simply wait for
      wait_event(get_pages_from_freelist() succeeds) rather than polling
      with complicated short sleep. This will significantly save CPU
      resource (especially when oom_lock is held) which is wasted by
      activities by multiple concurrent direct reclaim.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to make warn_alloc() reliable?
@ 2016-10-20 12:07         ` Tetsuo Handa
  0 siblings, 0 replies; 12+ messages in thread
From: Tetsuo Handa @ 2016-10-20 12:07 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, hannes, mgorman, dave.hansen, linux-mm, linux-kernel

Michal Hocko wrote:
> On Wed 19-10-16 20:27:53, Tetsuo Handa wrote:
> [...]
> > What I'm talking about is "why don't you stop playing whack-a-mole games
> > with missing warn_alloc() calls". I don't blame you for not having a good
> > idea, but I blame you for not having a reliable warn_alloc() mechanism.
> 
> Look, it seems pretty clear that our priorities and viewes are quite
> different. While I believe that we should solve real issues in a
> reliable and robust way you seem to love to be have as much reporting as
> possible. I do agree that reporting is important part of debugging of
> problems but as your previous attempts for the allocation watchdog show
> a proper and bullet proof reporting requires state tracking and is in
> general too complex for something that doesn't happen in most properly
> configured systems. Maybe there are other ways but my time is better
> spent on something more useful - like making the direct reclaim path
> more deterministic without any unbound loops.

Properly configured systems should not be bothered by low memory situations.
There are systems which are bothered by low memory situations. It is pointless
to refer to "properly configured systems" as a reason not to add a watchdog.
It is administrators who decide whether to use a watchdog.

> 
> So let's agree to disagree about importance of the reliability
> warn_alloc. I see it as an improvement which doesn't really have to be
> perfect.

I don't expect kmallocwd alone to be perfect. I expect kmallocwd to serve
as a hook. For example, it will be possible to turn on collecting perf data
when kmallocwd found a stalling thread and turn off when kmallocwd found
none. Since necessary information are stored in the task struct, it will
be easy to include them into perf data. Likewise, it will be easy to
extract them using a script for /usr/bin/crash when an administrator
captured a vmcore image of a stalling KVM guest. Sending vmcore images
to support centers is difficult due to file size and security reasons.
It is nice if we can get a clue by reading the task list.

But warn_alloc() can't serve as a hook. I see kmallocwd as an improvement
which doesn't really have to be perfect.



By the way, regarding "making the direct reclaim path more deterministic"
part, I wish that we can

  (1) introduce phased watermarks which varies based on stage of reclaim
      operation (e.g. watermark_lower()/watermark_higher() which resembles
      preempt_disable()/preempt_enable() but is propagated to other threads
      when delegating operations needed for reclaim to other threads).

  (2) introduce dedicated kernel threads which perform only specific
      reclaim operation, using watermark propagated from other threads
      which performs different reclaim operation.

  (3) remove direct reclaim which bothers callers with managing correct
      GFP_NOIO / GFP_NOFS / GFP_KERNEL distinction. Then, normal
      ___GFP_DIRECT_RECLAIM callers can simply wait for
      wait_event(get_pages_from_freelist() succeeds) rather than polling
      with complicated short sleep. This will significantly save CPU
      resource (especially when oom_lock is held) which is wasted by
      activities by multiple concurrent direct reclaim.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to make warn_alloc() reliable?
  2016-10-20 12:07         ` Tetsuo Handa
@ 2016-10-20 19:30           ` Michal Hocko
  -1 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2016-10-20 19:30 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, hannes, mgorman, dave.hansen, linux-mm, linux-kernel

On Thu 20-10-16 21:07:49, Tetsuo Handa wrote:
[...]
> By the way, regarding "making the direct reclaim path more deterministic"
> part, I wish that we can
> 
>   (1) introduce phased watermarks which varies based on stage of reclaim
>       operation (e.g. watermark_lower()/watermark_higher() which resembles
>       preempt_disable()/preempt_enable() but is propagated to other threads
>       when delegating operations needed for reclaim to other threads).
> 
>   (2) introduce dedicated kernel threads which perform only specific
>       reclaim operation, using watermark propagated from other threads
>       which performs different reclaim operation.
> 
>   (3) remove direct reclaim which bothers callers with managing correct
>       GFP_NOIO / GFP_NOFS / GFP_KERNEL distinction. Then, normal
>       ___GFP_DIRECT_RECLAIM callers can simply wait for
>       wait_event(get_pages_from_freelist() succeeds) rather than polling
>       with complicated short sleep. This will significantly save CPU
>       resource (especially when oom_lock is held) which is wasted by
>       activities by multiple concurrent direct reclaim.

As always, you are free to come up with patches with the proper
justification and convince people that those steps will help both the
regular case as well of those you are bothered with.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to make warn_alloc() reliable?
@ 2016-10-20 19:30           ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2016-10-20 19:30 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, hannes, mgorman, dave.hansen, linux-mm, linux-kernel

On Thu 20-10-16 21:07:49, Tetsuo Handa wrote:
[...]
> By the way, regarding "making the direct reclaim path more deterministic"
> part, I wish that we can
> 
>   (1) introduce phased watermarks which varies based on stage of reclaim
>       operation (e.g. watermark_lower()/watermark_higher() which resembles
>       preempt_disable()/preempt_enable() but is propagated to other threads
>       when delegating operations needed for reclaim to other threads).
> 
>   (2) introduce dedicated kernel threads which perform only specific
>       reclaim operation, using watermark propagated from other threads
>       which performs different reclaim operation.
> 
>   (3) remove direct reclaim which bothers callers with managing correct
>       GFP_NOIO / GFP_NOFS / GFP_KERNEL distinction. Then, normal
>       ___GFP_DIRECT_RECLAIM callers can simply wait for
>       wait_event(get_pages_from_freelist() succeeds) rather than polling
>       with complicated short sleep. This will significantly save CPU
>       resource (especially when oom_lock is held) which is wasted by
>       activities by multiple concurrent direct reclaim.

As always, you are free to come up with patches with the proper
justification and convince people that those steps will help both the
regular case as well of those you are bothered with.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-10-20 19:30 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-18 11:04 How to make warn_alloc() reliable? Tetsuo Handa
2016-10-18 11:04 ` Tetsuo Handa
2016-10-18 12:27 ` Michal Hocko
2016-10-18 12:27   ` Michal Hocko
2016-10-19 11:27   ` Tetsuo Handa
2016-10-19 11:27     ` Tetsuo Handa
2016-10-19 11:55     ` Michal Hocko
2016-10-19 11:55       ` Michal Hocko
2016-10-20 12:07       ` Tetsuo Handa
2016-10-20 12:07         ` Tetsuo Handa
2016-10-20 19:30         ` Michal Hocko
2016-10-20 19:30           ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.