All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2015-12-29 13:58 ` Tetsuo Handa
  0 siblings, 0 replies; 44+ messages in thread
From: Tetsuo Handa @ 2015-12-29 13:58 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: mgorman, rientjes, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel

>From 8bb9e36891a803e82c589ef78077838026ce0f7d Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Tue, 29 Dec 2015 22:20:58 +0900
Subject: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.

The OOM reaper kernel thread can reclaim OOM victim's memory before the victim
terminates. But since oom_kill_process() tries to kill children of the memory
hog process first, the OOM reaper can not reclaim enough memory for terminating
the victim if the victim is consuming little memory. The result is OOM livelock
as usual, for timeout based next OOM victim selection is not implemented.

While SysRq-f (manual invocation of the OOM killer) can wake up the OOM killer,
the OOM killer chooses the same OOM victim which already has TIF_MEMDIE. This
is effectively disabling SysRq-f.

This patch excludes TIF_MEMDIE processes from candidates so that the memory
hog process itself will be killed when all children of the memory hog process
got stuck with TIF_MEMDIE pending.

----------
[  120.078776] oom-write invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
[  120.088610] oom-write cpuset=/ mems_allowed=0
[  120.095558] CPU: 0 PID: 9546 Comm: oom-write Not tainted 4.4.0-rc6-next-20151223 #260
(...snipped...)
[  120.194148] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[  120.260191] [ 9546]  1000  9546   541716   453473     896       6        0             0 oom-write
[  120.262166] [ 9547]  1000  9547       40        1       3       2        0             0 write
[  120.264071] [ 9548]  1000  9548       40        1       3       2        0             0 write
[  120.265939] [ 9549]  1000  9549       40        1       4       2        0             0 write
[  120.267794] [ 9550]  1000  9550       40        1       3       2        0             0 write
[  120.269654] [ 9551]  1000  9551       40        1       3       2        0             0 write
[  120.271447] [ 9552]  1000  9552       40        1       3       2        0             0 write
[  120.273220] [ 9553]  1000  9553       40        1       3       2        0             0 write
[  120.274975] [ 9554]  1000  9554       40        1       3       2        0             0 write
[  120.276745] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  120.278516] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  120.280227] Out of memory: Kill process 9546 (oom-write) score 892 or sacrifice child
[  120.282010] Killed process 9549 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  122.506001] systemd-journal invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|GFP_COLD)
[  122.515041] systemd-journal cpuset=/ mems_allowed=0
(...snipped...)
[  122.697515] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  122.699492] [ 9551]  1000  9551       40        1       3       2        0             0 write
[  122.701399] [ 9552]  1000  9552       40        1       3       2        0             0 write
[  122.703282] [ 9553]  1000  9553       40        1       3       2        0             0 write
[  122.705188] [ 9554]  1000  9554       40        1       3       2        0             0 write
[  122.707017] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  122.708842] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  122.710675] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  122.712475] Killed process 9551 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[  139.606508] sysrq: SysRq : Manual OOM execution
[  139.612371] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
[  139.620210] kworker/0:2 cpuset=/ mems_allowed=0
(...snipped...)
[  139.795759] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  139.797649] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  139.799526] [ 9552]  1000  9552       40        1       3       2        0             0 write
[  139.801368] [ 9553]  1000  9553       40        1       3       2        0             0 write
[  139.803249] [ 9554]  1000  9554       40        1       3       2        0             0 write
[  139.805020] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  139.806799] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  139.808524] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  139.810216] Killed process 9552 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  142.571815] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  142.573840] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  142.575754] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  142.577633] [ 9553]  1000  9553       40        1       3       2        0             0 write
[  142.579433] [ 9554]  1000  9554       40        1       3       2        0             0 write
[  142.581250] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  142.583003] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  142.585055] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  142.586796] Killed process 9553 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[  143.599058] sysrq: SysRq : Manual OOM execution
[  143.604300] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
(...snipped...)
[  143.783739] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  143.785691] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  143.787532] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  143.789377] [ 9553]  1000  9553       40        0       3       2        0             0 write
[  143.791172] [ 9554]  1000  9554       40        1       3       2        0             0 write
[  143.792985] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  143.794730] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  143.796723] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  143.798338] Killed process 9554 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[  144.374525] sysrq: SysRq : Manual OOM execution
[  144.379779] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
(...snipped...)
[  144.560718] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  144.562657] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  144.564560] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  144.566369] [ 9553]  1000  9553       40        0       3       2        0             0 write
[  144.568246] [ 9554]  1000  9554       40        0       3       2        0             0 write
[  144.570001] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  144.571794] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  144.573502] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  144.575119] Killed process 9555 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[  145.158485] sysrq: SysRq : Manual OOM execution
[  145.163600] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
(...snipped...)
[  145.346059] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  145.348012] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  145.349954] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  145.351817] [ 9553]  1000  9553       40        0       3       2        0             0 write
[  145.353701] [ 9554]  1000  9554       40        0       3       2        0             0 write
[  145.355568] [ 9555]  1000  9555       40        0       3       2        0             0 write
[  145.357319] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  145.359114] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  145.360733] Killed process 9556 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[  169.158408] sysrq: SysRq : Manual OOM execution
[  169.163612] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
(...snipped...)
[  169.343115] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  169.345053] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  169.346884] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  169.348965] [ 9553]  1000  9553       40        0       3       2        0             0 write
[  169.350893] [ 9554]  1000  9554       40        0       3       2        0             0 write
[  169.352713] [ 9555]  1000  9555       40        0       3       2        0             0 write
[  169.354551] [ 9556]  1000  9556       40        0       3       2        0             0 write
[  169.356450] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  169.358105] Killed process 9551 (write) total-vm:160kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  178.950315] sysrq: SysRq : Manual OOM execution
[  178.955560] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
(...snipped...)
[  179.140752] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  179.142653] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  179.144997] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  179.146849] [ 9553]  1000  9553       40        0       3       2        0             0 write
[  179.148654] [ 9554]  1000  9554       40        0       3       2        0             0 write
[  179.150411] [ 9555]  1000  9555       40        0       3       2        0             0 write
[  179.152291] [ 9556]  1000  9556       40        0       3       2        0             0 write
[  179.154002] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  179.155666] Killed process 9551 (write) total-vm:160kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
----------

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4b0a5d8..a1a0f39 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -111,6 +111,18 @@ struct task_struct *find_lock_task_mm(struct task_struct *p)
 
 	rcu_read_lock();
 
+	/*
+	 * Treat the whole process p as unkillable when one of subthreads has
+	 * TIF_MEMDIE pending. Otherwise, we may end up setting TIF_MEMDIE on
+	 * the same victim forever (e.g. making SysRq-f unusable).
+	 */
+	for_each_thread(p, t) {
+		if (likely(!test_tsk_thread_flag(t, TIF_MEMDIE)))
+			continue;
+		t = NULL;
+		goto found;
+	}
+
 	for_each_thread(p, t) {
 		task_lock(t);
 		if (likely(t->mm))
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2015-12-29 13:58 ` Tetsuo Handa
  0 siblings, 0 replies; 44+ messages in thread
From: Tetsuo Handa @ 2015-12-29 13:58 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: mgorman, rientjes, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel

>From 8bb9e36891a803e82c589ef78077838026ce0f7d Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Tue, 29 Dec 2015 22:20:58 +0900
Subject: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.

The OOM reaper kernel thread can reclaim OOM victim's memory before the victim
terminates. But since oom_kill_process() tries to kill children of the memory
hog process first, the OOM reaper can not reclaim enough memory for terminating
the victim if the victim is consuming little memory. The result is OOM livelock
as usual, for timeout based next OOM victim selection is not implemented.

While SysRq-f (manual invocation of the OOM killer) can wake up the OOM killer,
the OOM killer chooses the same OOM victim which already has TIF_MEMDIE. This
is effectively disabling SysRq-f.

This patch excludes TIF_MEMDIE processes from candidates so that the memory
hog process itself will be killed when all children of the memory hog process
got stuck with TIF_MEMDIE pending.

----------
[  120.078776] oom-write invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
[  120.088610] oom-write cpuset=/ mems_allowed=0
[  120.095558] CPU: 0 PID: 9546 Comm: oom-write Not tainted 4.4.0-rc6-next-20151223 #260
(...snipped...)
[  120.194148] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[  120.260191] [ 9546]  1000  9546   541716   453473     896       6        0             0 oom-write
[  120.262166] [ 9547]  1000  9547       40        1       3       2        0             0 write
[  120.264071] [ 9548]  1000  9548       40        1       3       2        0             0 write
[  120.265939] [ 9549]  1000  9549       40        1       4       2        0             0 write
[  120.267794] [ 9550]  1000  9550       40        1       3       2        0             0 write
[  120.269654] [ 9551]  1000  9551       40        1       3       2        0             0 write
[  120.271447] [ 9552]  1000  9552       40        1       3       2        0             0 write
[  120.273220] [ 9553]  1000  9553       40        1       3       2        0             0 write
[  120.274975] [ 9554]  1000  9554       40        1       3       2        0             0 write
[  120.276745] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  120.278516] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  120.280227] Out of memory: Kill process 9546 (oom-write) score 892 or sacrifice child
[  120.282010] Killed process 9549 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  122.506001] systemd-journal invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|GFP_COLD)
[  122.515041] systemd-journal cpuset=/ mems_allowed=0
(...snipped...)
[  122.697515] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  122.699492] [ 9551]  1000  9551       40        1       3       2        0             0 write
[  122.701399] [ 9552]  1000  9552       40        1       3       2        0             0 write
[  122.703282] [ 9553]  1000  9553       40        1       3       2        0             0 write
[  122.705188] [ 9554]  1000  9554       40        1       3       2        0             0 write
[  122.707017] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  122.708842] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  122.710675] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  122.712475] Killed process 9551 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[  139.606508] sysrq: SysRq : Manual OOM execution
[  139.612371] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
[  139.620210] kworker/0:2 cpuset=/ mems_allowed=0
(...snipped...)
[  139.795759] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  139.797649] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  139.799526] [ 9552]  1000  9552       40        1       3       2        0             0 write
[  139.801368] [ 9553]  1000  9553       40        1       3       2        0             0 write
[  139.803249] [ 9554]  1000  9554       40        1       3       2        0             0 write
[  139.805020] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  139.806799] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  139.808524] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  139.810216] Killed process 9552 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  142.571815] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  142.573840] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  142.575754] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  142.577633] [ 9553]  1000  9553       40        1       3       2        0             0 write
[  142.579433] [ 9554]  1000  9554       40        1       3       2        0             0 write
[  142.581250] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  142.583003] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  142.585055] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  142.586796] Killed process 9553 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[  143.599058] sysrq: SysRq : Manual OOM execution
[  143.604300] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
(...snipped...)
[  143.783739] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  143.785691] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  143.787532] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  143.789377] [ 9553]  1000  9553       40        0       3       2        0             0 write
[  143.791172] [ 9554]  1000  9554       40        1       3       2        0             0 write
[  143.792985] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  143.794730] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  143.796723] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  143.798338] Killed process 9554 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[  144.374525] sysrq: SysRq : Manual OOM execution
[  144.379779] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
(...snipped...)
[  144.560718] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  144.562657] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  144.564560] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  144.566369] [ 9553]  1000  9553       40        0       3       2        0             0 write
[  144.568246] [ 9554]  1000  9554       40        0       3       2        0             0 write
[  144.570001] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  144.571794] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  144.573502] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  144.575119] Killed process 9555 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[  145.158485] sysrq: SysRq : Manual OOM execution
[  145.163600] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
(...snipped...)
[  145.346059] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  145.348012] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  145.349954] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  145.351817] [ 9553]  1000  9553       40        0       3       2        0             0 write
[  145.353701] [ 9554]  1000  9554       40        0       3       2        0             0 write
[  145.355568] [ 9555]  1000  9555       40        0       3       2        0             0 write
[  145.357319] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  145.359114] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  145.360733] Killed process 9556 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[  169.158408] sysrq: SysRq : Manual OOM execution
[  169.163612] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
(...snipped...)
[  169.343115] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  169.345053] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  169.346884] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  169.348965] [ 9553]  1000  9553       40        0       3       2        0             0 write
[  169.350893] [ 9554]  1000  9554       40        0       3       2        0             0 write
[  169.352713] [ 9555]  1000  9555       40        0       3       2        0             0 write
[  169.354551] [ 9556]  1000  9556       40        0       3       2        0             0 write
[  169.356450] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  169.358105] Killed process 9551 (write) total-vm:160kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  178.950315] sysrq: SysRq : Manual OOM execution
[  178.955560] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
(...snipped...)
[  179.140752] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  179.142653] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  179.144997] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  179.146849] [ 9553]  1000  9553       40        0       3       2        0             0 write
[  179.148654] [ 9554]  1000  9554       40        0       3       2        0             0 write
[  179.150411] [ 9555]  1000  9555       40        0       3       2        0             0 write
[  179.152291] [ 9556]  1000  9556       40        0       3       2        0             0 write
[  179.154002] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  179.155666] Killed process 9551 (write) total-vm:160kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
----------

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4b0a5d8..a1a0f39 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -111,6 +111,18 @@ struct task_struct *find_lock_task_mm(struct task_struct *p)
 
 	rcu_read_lock();
 
+	/*
+	 * Treat the whole process p as unkillable when one of subthreads has
+	 * TIF_MEMDIE pending. Otherwise, we may end up setting TIF_MEMDIE on
+	 * the same victim forever (e.g. making SysRq-f unusable).
+	 */
+	for_each_thread(p, t) {
+		if (likely(!test_tsk_thread_flag(t, TIF_MEMDIE)))
+			continue;
+		t = NULL;
+		goto found;
+	}
+
 	for_each_thread(p, t) {
 		task_lock(t);
 		if (likely(t->mm))
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
  2015-12-29 13:58 ` Tetsuo Handa
@ 2016-01-07  9:15   ` Michal Hocko
  -1 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-01-07  9:15 UTC (permalink / raw)
  To: Tetsuo Handa, rientjes
  Cc: akpm, mgorman, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel

On Tue 29-12-15 22:58:22, Tetsuo Handa wrote:
[...]
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 4b0a5d8..a1a0f39 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -111,6 +111,18 @@ struct task_struct *find_lock_task_mm(struct task_struct *p)
>  
>  	rcu_read_lock();
>  
> +	/*
> +	 * Treat the whole process p as unkillable when one of subthreads has
> +	 * TIF_MEMDIE pending. Otherwise, we may end up setting TIF_MEMDIE on
> +	 * the same victim forever (e.g. making SysRq-f unusable).
> +	 */
> +	for_each_thread(p, t) {
> +		if (likely(!test_tsk_thread_flag(t, TIF_MEMDIE)))
> +			continue;
> +		t = NULL;
> +		goto found;
> +	}
> +

I do not think the placement in find_lock_task_mm is desirable nor
correct. This function is used in multiple contexts outside of the oom
proper. It only returns a locked task_struct for a thread that belongs
to the process.

>  	for_each_thread(p, t) {
>  		task_lock(t);
>  		if (likely(t->mm))

What you are seeing is clearly undesirable of course but I believe we
should handle it at oom_kill_process layer. Blindly selecting a child
process even when it doesn't sit on some memory or when it has already
been killed is wrong. The heuristic is clearly too naive and so we
should touch it rather than compensating it somewhere else. What about
the following simple approach? It does two things and I will split it
up if this looks like a desirable approach. Please note I haven't tested
it because it is more of an idea than a finished thing. What do you think?
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e4af31db96f..a7c965777001 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -638,6 +638,73 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
+
+/*
+ * If any of victim's children has a different mm and is eligible for kill,
+ * the one with the highest oom_badness() score is sacrificed for its
+ * parent.  This attempts to lose the minimal amount of work done while
+ * still freeing memory.
+ */
+static struct task_struct *
+try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim,
+		       unsigned long totalpages, struct mem_cgroup *memcg)
+{
+	struct task_struct *child_victim = NULL;
+	unsigned int victim_points = 0;
+	struct task_struct *t;
+
+	read_lock(&tasklist_lock);
+	for_each_thread(victim, t) {
+		struct task_struct *child;
+
+		list_for_each_entry(child, &t->children, sibling) {
+			unsigned int child_points;
+
+			/*
+			 * Skip over already OOM killed children as this hasn't
+			 * helped to resolve the situation obviously.
+			 * oom_scan_process_thread would abort scanning when
+			 * seeing them but this is not the case so we must be
+			 * doing forced OOM kill and so we do not want to loop
+			 * over the same tasks again
+			 */
+			if (test_tsk_thread_flag(child, TIF_MEMDIE))
+				continue;
+
+			if (process_shares_mm(child, victim->mm))
+				continue;
+
+			child_points = oom_badness(child, memcg, oc->nodemask,
+								totalpages);
+			if (child_points > victim_points) {
+				if (child_victim)
+					put_task_struct(child_victim);
+				child_victim = child;
+				victim_points = child_points;
+				get_task_struct(child_victim);
+			}
+		}
+	}
+	read_unlock(&tasklist_lock);
+
+	if (!child_victim)
+		goto out;
+
+	/*
+	 * Protecting the parent makes sense only if killing the child
+	 * would release at least some memory (at least 1MB).
+	 */
+	if (K(victim_points) >= 1024) {
+		put_task_struct(victim);
+		victim = child_victim;
+	} else {
+		put_task_struct(child_victim);
+	}
+
+out:
+	return victim;
+}
+
 /*
  * Must be called while holding a reference to p, which will be released upon
  * returning.
@@ -647,10 +714,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		      struct mem_cgroup *memcg, const char *message)
 {
 	struct task_struct *victim = p;
-	struct task_struct *child;
-	struct task_struct *t;
 	struct mm_struct *mm;
-	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
 	bool can_oom_reap = true;
@@ -674,34 +738,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
 		message, task_pid_nr(p), p->comm, points);
 
-	/*
-	 * If any of p's children has a different mm and is eligible for kill,
-	 * the one with the highest oom_badness() score is sacrificed for its
-	 * parent.  This attempts to lose the minimal amount of work done while
-	 * still freeing memory.
-	 */
-	read_lock(&tasklist_lock);
-	for_each_thread(p, t) {
-		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
-
-			if (process_shares_mm(child, p->mm))
-				continue;
-			/*
-			 * oom_badness() returns 0 if the thread is unkillable
-			 */
-			child_points = oom_badness(child, memcg, oc->nodemask,
-								totalpages);
-			if (child_points > victim_points) {
-				put_task_struct(victim);
-				victim = child;
-				victim_points = child_points;
-				get_task_struct(victim);
-			}
-		}
-	}
-	read_unlock(&tasklist_lock);
-
+	victim = try_to_sacrifice_child(oc, victim, totalpages, memcg);
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2016-01-07  9:15   ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-01-07  9:15 UTC (permalink / raw)
  To: Tetsuo Handa, rientjes
  Cc: akpm, mgorman, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel

On Tue 29-12-15 22:58:22, Tetsuo Handa wrote:
[...]
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 4b0a5d8..a1a0f39 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -111,6 +111,18 @@ struct task_struct *find_lock_task_mm(struct task_struct *p)
>  
>  	rcu_read_lock();
>  
> +	/*
> +	 * Treat the whole process p as unkillable when one of subthreads has
> +	 * TIF_MEMDIE pending. Otherwise, we may end up setting TIF_MEMDIE on
> +	 * the same victim forever (e.g. making SysRq-f unusable).
> +	 */
> +	for_each_thread(p, t) {
> +		if (likely(!test_tsk_thread_flag(t, TIF_MEMDIE)))
> +			continue;
> +		t = NULL;
> +		goto found;
> +	}
> +

I do not think the placement in find_lock_task_mm is desirable nor
correct. This function is used in multiple contexts outside of the oom
proper. It only returns a locked task_struct for a thread that belongs
to the process.

>  	for_each_thread(p, t) {
>  		task_lock(t);
>  		if (likely(t->mm))

What you are seeing is clearly undesirable of course but I believe we
should handle it at oom_kill_process layer. Blindly selecting a child
process even when it doesn't sit on some memory or when it has already
been killed is wrong. The heuristic is clearly too naive and so we
should touch it rather than compensating it somewhere else. What about
the following simple approach? It does two things and I will split it
up if this looks like a desirable approach. Please note I haven't tested
it because it is more of an idea than a finished thing. What do you think?
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e4af31db96f..a7c965777001 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -638,6 +638,73 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
+
+/*
+ * If any of victim's children has a different mm and is eligible for kill,
+ * the one with the highest oom_badness() score is sacrificed for its
+ * parent.  This attempts to lose the minimal amount of work done while
+ * still freeing memory.
+ */
+static struct task_struct *
+try_to_sacrifice_child(struct oom_control *oc, struct task_struct *victim,
+		       unsigned long totalpages, struct mem_cgroup *memcg)
+{
+	struct task_struct *child_victim = NULL;
+	unsigned int victim_points = 0;
+	struct task_struct *t;
+
+	read_lock(&tasklist_lock);
+	for_each_thread(victim, t) {
+		struct task_struct *child;
+
+		list_for_each_entry(child, &t->children, sibling) {
+			unsigned int child_points;
+
+			/*
+			 * Skip over already OOM killed children as this hasn't
+			 * helped to resolve the situation obviously.
+			 * oom_scan_process_thread would abort scanning when
+			 * seeing them but this is not the case so we must be
+			 * doing forced OOM kill and so we do not want to loop
+			 * over the same tasks again
+			 */
+			if (test_tsk_thread_flag(child, TIF_MEMDIE))
+				continue;
+
+			if (process_shares_mm(child, victim->mm))
+				continue;
+
+			child_points = oom_badness(child, memcg, oc->nodemask,
+								totalpages);
+			if (child_points > victim_points) {
+				if (child_victim)
+					put_task_struct(child_victim);
+				child_victim = child;
+				victim_points = child_points;
+				get_task_struct(child_victim);
+			}
+		}
+	}
+	read_unlock(&tasklist_lock);
+
+	if (!child_victim)
+		goto out;
+
+	/*
+	 * Protecting the parent makes sense only if killing the child
+	 * would release at least some memory (at least 1MB).
+	 */
+	if (K(victim_points) >= 1024) {
+		put_task_struct(victim);
+		victim = child_victim;
+	} else {
+		put_task_struct(child_victim);
+	}
+
+out:
+	return victim;
+}
+
 /*
  * Must be called while holding a reference to p, which will be released upon
  * returning.
@@ -647,10 +714,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		      struct mem_cgroup *memcg, const char *message)
 {
 	struct task_struct *victim = p;
-	struct task_struct *child;
-	struct task_struct *t;
 	struct mm_struct *mm;
-	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
 	bool can_oom_reap = true;
@@ -674,34 +738,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
 		message, task_pid_nr(p), p->comm, points);
 
-	/*
-	 * If any of p's children has a different mm and is eligible for kill,
-	 * the one with the highest oom_badness() score is sacrificed for its
-	 * parent.  This attempts to lose the minimal amount of work done while
-	 * still freeing memory.
-	 */
-	read_lock(&tasklist_lock);
-	for_each_thread(p, t) {
-		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
-
-			if (process_shares_mm(child, p->mm))
-				continue;
-			/*
-			 * oom_badness() returns 0 if the thread is unkillable
-			 */
-			child_points = oom_badness(child, memcg, oc->nodemask,
-								totalpages);
-			if (child_points > victim_points) {
-				put_task_struct(victim);
-				victim = child;
-				victim_points = child_points;
-				get_task_struct(victim);
-			}
-		}
-	}
-	read_unlock(&tasklist_lock);
-
+	victim = try_to_sacrifice_child(oc, victim, totalpages, memcg);
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
  2016-01-07  9:15   ` Michal Hocko
@ 2016-01-07 13:31     ` Tetsuo Handa
  -1 siblings, 0 replies; 44+ messages in thread
From: Tetsuo Handa @ 2016-01-07 13:31 UTC (permalink / raw)
  To: mhocko, rientjes
  Cc: akpm, mgorman, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel

Michal Hocko wrote:
> I do not think the placement in find_lock_task_mm is desirable nor
> correct. This function is used in multiple contexts outside of the oom
> proper. It only returns a locked task_struct for a thread that belongs
> to the process.

OK. Andrew, please drop from -mm tree for now.

> What you are seeing is clearly undesirable of course but I believe we
> should handle it at oom_kill_process layer. Blindly selecting a child
> process even when it doesn't sit on some memory or when it has already
> been killed is wrong. The heuristic is clearly too naive and so we
> should touch it rather than compensating it somewhere else. What about
> the following simple approach? It does two things and I will split it
> up if this looks like a desirable approach. Please note I haven't tested
> it because it is more of an idea than a finished thing. What do you think?

I think we need to filter at select_bad_process() and oom_kill_process().

When P has no children, P is chosen and TIF_MEMDIE is set on P. But P can
be chosen forever due to P->signal->oom_score_adj == OOM_SCORE_ADJ_MAX
even if the OOM reaper reclaimed P's mm. We need to ensure that
oom_kill_process() is not called with P if P already has TIF_MEMDIE.

(By the way, we are already assuming the OOM reaper kernel thread is
available. Changing to BUG_ON(IS_ERR(oom_reaper_th)) should be OK. ;-) )

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2016-01-07 13:31     ` Tetsuo Handa
  0 siblings, 0 replies; 44+ messages in thread
From: Tetsuo Handa @ 2016-01-07 13:31 UTC (permalink / raw)
  To: mhocko, rientjes
  Cc: akpm, mgorman, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel

Michal Hocko wrote:
> I do not think the placement in find_lock_task_mm is desirable nor
> correct. This function is used in multiple contexts outside of the oom
> proper. It only returns a locked task_struct for a thread that belongs
> to the process.

OK. Andrew, please drop from -mm tree for now.

> What you are seeing is clearly undesirable of course but I believe we
> should handle it at oom_kill_process layer. Blindly selecting a child
> process even when it doesn't sit on some memory or when it has already
> been killed is wrong. The heuristic is clearly too naive and so we
> should touch it rather than compensating it somewhere else. What about
> the following simple approach? It does two things and I will split it
> up if this looks like a desirable approach. Please note I haven't tested
> it because it is more of an idea than a finished thing. What do you think?

I think we need to filter at select_bad_process() and oom_kill_process().

When P has no children, P is chosen and TIF_MEMDIE is set on P. But P can
be chosen forever due to P->signal->oom_score_adj == OOM_SCORE_ADJ_MAX
even if the OOM reaper reclaimed P's mm. We need to ensure that
oom_kill_process() is not called with P if P already has TIF_MEMDIE.

(By the way, we are already assuming the OOM reaper kernel thread is
available. Changing to BUG_ON(IS_ERR(oom_reaper_th)) should be OK. ;-) )

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
  2016-01-07 13:31     ` Tetsuo Handa
@ 2016-01-07 14:58       ` Michal Hocko
  -1 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-01-07 14:58 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

On Thu 07-01-16 22:31:32, Tetsuo Handa wrote:
[...]
> I think we need to filter at select_bad_process() and oom_kill_process().
> 
> When P has no children, P is chosen and TIF_MEMDIE is set on P. But P can
> be chosen forever due to P->signal->oom_score_adj == OOM_SCORE_ADJ_MAX
> even if the OOM reaper reclaimed P's mm. We need to ensure that
> oom_kill_process() is not called with P if P already has TIF_MEMDIE.

Hmm. Any task is allowed to set its oom_score_adj that way and I
guess we should really make sure that at least sysrq+f will make some
progress. This is what I would do. Again I think this is worth a
separate patch. Unless there are any objections I will roll out what I
have and post 3 separate patches.
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 45e51ad2f7cf..ee34a51bd65a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -333,6 +333,14 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
 		if (points == chosen_points && thread_group_leader(chosen))
 			continue;
 
+		/*
+		 * If the current major task is already ooom killed and this
+		 * is sysrq+f request then we rather choose somebody else
+		 * because the current oom victim might be stuck.
+		 */
+		if (is_sysrq_oom(sc) && test_tsk_thread_flag(p, TIF_MEMDIE))
+			continue;
+
 		chosen = p;
 		chosen_points = points;
 	}
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2016-01-07 14:58       ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-01-07 14:58 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

On Thu 07-01-16 22:31:32, Tetsuo Handa wrote:
[...]
> I think we need to filter at select_bad_process() and oom_kill_process().
> 
> When P has no children, P is chosen and TIF_MEMDIE is set on P. But P can
> be chosen forever due to P->signal->oom_score_adj == OOM_SCORE_ADJ_MAX
> even if the OOM reaper reclaimed P's mm. We need to ensure that
> oom_kill_process() is not called with P if P already has TIF_MEMDIE.

Hmm. Any task is allowed to set its oom_score_adj that way and I
guess we should really make sure that at least sysrq+f will make some
progress. This is what I would do. Again I think this is worth a
separate patch. Unless there are any objections I will roll out what I
have and post 3 separate patches.
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 45e51ad2f7cf..ee34a51bd65a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -333,6 +333,14 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
 		if (points == chosen_points && thread_group_leader(chosen))
 			continue;
 
+		/*
+		 * If the current major task is already ooom killed and this
+		 * is sysrq+f request then we rather choose somebody else
+		 * because the current oom victim might be stuck.
+		 */
+		if (is_sysrq_oom(sc) && test_tsk_thread_flag(p, TIF_MEMDIE))
+			continue;
+
 		chosen = p;
 		chosen_points = points;
 	}
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
  2016-01-07 14:58       ` Michal Hocko
@ 2016-01-07 15:38         ` Tetsuo Handa
  -1 siblings, 0 replies; 44+ messages in thread
From: Tetsuo Handa @ 2016-01-07 15:38 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> @@ -333,6 +333,14 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
>  		if (points == chosen_points && thread_group_leader(chosen))
>  			continue;
>  
> +		/*
> +		 * If the current major task is already ooom killed and this
> +		 * is sysrq+f request then we rather choose somebody else
> +		 * because the current oom victim might be stuck.
> +		 */
> +		if (is_sysrq_oom(sc) && test_tsk_thread_flag(p, TIF_MEMDIE))
> +			continue;
> +
>  		chosen = p;
>  		chosen_points = points;
>  	}

Do we want to require SysRq-f for each thread in a process?
If g has 1024 p, dump_tasks() will do

  pr_info("[%5d] %5d %5d %8lu %8lu %7ld %7ld %8lu         %5hd %s\n",

for 1024 times? I think one SysRq-f per one process is sufficient.

How can we guarantee that find_lock_task_mm() from oom_kill_process()
chooses !TIF_MEMDIE thread when try_to_sacrifice_child() somehow chose
!TIF_MEMDIE thread? I think choosing !TIF_MEMDIE thread at
find_lock_task_mm() is the simplest way.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2016-01-07 15:38         ` Tetsuo Handa
  0 siblings, 0 replies; 44+ messages in thread
From: Tetsuo Handa @ 2016-01-07 15:38 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> @@ -333,6 +333,14 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
>  		if (points == chosen_points && thread_group_leader(chosen))
>  			continue;
>  
> +		/*
> +		 * If the current major task is already ooom killed and this
> +		 * is sysrq+f request then we rather choose somebody else
> +		 * because the current oom victim might be stuck.
> +		 */
> +		if (is_sysrq_oom(sc) && test_tsk_thread_flag(p, TIF_MEMDIE))
> +			continue;
> +
>  		chosen = p;
>  		chosen_points = points;
>  	}

Do we want to require SysRq-f for each thread in a process?
If g has 1024 p, dump_tasks() will do

  pr_info("[%5d] %5d %5d %8lu %8lu %7ld %7ld %8lu         %5hd %s\n",

for 1024 times? I think one SysRq-f per one process is sufficient.

How can we guarantee that find_lock_task_mm() from oom_kill_process()
chooses !TIF_MEMDIE thread when try_to_sacrifice_child() somehow chose
!TIF_MEMDIE thread? I think choosing !TIF_MEMDIE thread at
find_lock_task_mm() is the simplest way.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
  2016-01-07 14:58       ` Michal Hocko
@ 2016-01-07 15:44         ` Michal Hocko
  -1 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-01-07 15:44 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

On Thu 07-01-16 15:58:41, Michal Hocko wrote:
> On Thu 07-01-16 22:31:32, Tetsuo Handa wrote:
> [...]
> > I think we need to filter at select_bad_process() and oom_kill_process().
> > 
> > When P has no children, P is chosen and TIF_MEMDIE is set on P. But P can
> > be chosen forever due to P->signal->oom_score_adj == OOM_SCORE_ADJ_MAX
> > even if the OOM reaper reclaimed P's mm. We need to ensure that
> > oom_kill_process() is not called with P if P already has TIF_MEMDIE.
> 
> Hmm. Any task is allowed to set its oom_score_adj that way and I
> guess we should really make sure that at least sysrq+f will make some
> progress. This is what I would do. Again I think this is worth a
> separate patch. Unless there are any objections I will roll out what I
> have and post 3 separate patches.
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 45e51ad2f7cf..ee34a51bd65a 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -333,6 +333,14 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
>  		if (points == chosen_points && thread_group_leader(chosen))
>  			continue;
>  
> +		/*
> +		 * If the current major task is already ooom killed and this
> +		 * is sysrq+f request then we rather choose somebody else
> +		 * because the current oom victim might be stuck.
> +		 */
> +		if (is_sysrq_oom(sc) && test_tsk_thread_flag(p, TIF_MEMDIE))
> +			continue;
> +
>  		chosen = p;
>  		chosen_points = points;
>  	}

I guess we can move this up to oom_scan_process_thread already. It would
be simpler and I it should be also more appropriate because we already
do sysrq specific handling there:
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 45e51ad2f7cf..a27a43212075 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -277,10 +277,16 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	/*
 	 * This task already has access to memory reserves and is being killed.
 	 * Don't allow any other task to have access to the reserves.
+	 * If we are doing sysrq+f then it doesn't make any sense to check such
+	 * a task because it might be stuck and unable to terminate while the
+	 * forced OOM might be the only option left to get the system back to
+	 * work.
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
 		if (!is_sysrq_oom(oc))
 			return OOM_SCAN_ABORT;
+		else
+			return OOM_SCAN_CONTINUE;
 	}
 	if (!task->mm)
 		return OOM_SCAN_CONTINUE;

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2016-01-07 15:44         ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-01-07 15:44 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

On Thu 07-01-16 15:58:41, Michal Hocko wrote:
> On Thu 07-01-16 22:31:32, Tetsuo Handa wrote:
> [...]
> > I think we need to filter at select_bad_process() and oom_kill_process().
> > 
> > When P has no children, P is chosen and TIF_MEMDIE is set on P. But P can
> > be chosen forever due to P->signal->oom_score_adj == OOM_SCORE_ADJ_MAX
> > even if the OOM reaper reclaimed P's mm. We need to ensure that
> > oom_kill_process() is not called with P if P already has TIF_MEMDIE.
> 
> Hmm. Any task is allowed to set its oom_score_adj that way and I
> guess we should really make sure that at least sysrq+f will make some
> progress. This is what I would do. Again I think this is worth a
> separate patch. Unless there are any objections I will roll out what I
> have and post 3 separate patches.
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 45e51ad2f7cf..ee34a51bd65a 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -333,6 +333,14 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
>  		if (points == chosen_points && thread_group_leader(chosen))
>  			continue;
>  
> +		/*
> +		 * If the current major task is already ooom killed and this
> +		 * is sysrq+f request then we rather choose somebody else
> +		 * because the current oom victim might be stuck.
> +		 */
> +		if (is_sysrq_oom(sc) && test_tsk_thread_flag(p, TIF_MEMDIE))
> +			continue;
> +
>  		chosen = p;
>  		chosen_points = points;
>  	}

I guess we can move this up to oom_scan_process_thread already. It would
be simpler and I it should be also more appropriate because we already
do sysrq specific handling there:
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 45e51ad2f7cf..a27a43212075 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -277,10 +277,16 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	/*
 	 * This task already has access to memory reserves and is being killed.
 	 * Don't allow any other task to have access to the reserves.
+	 * If we are doing sysrq+f then it doesn't make any sense to check such
+	 * a task because it might be stuck and unable to terminate while the
+	 * forced OOM might be the only option left to get the system back to
+	 * work.
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
 		if (!is_sysrq_oom(oc))
 			return OOM_SCAN_ABORT;
+		else
+			return OOM_SCAN_CONTINUE;
 	}
 	if (!task->mm)
 		return OOM_SCAN_CONTINUE;

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
  2015-12-29 13:58 ` Tetsuo Handa
@ 2016-01-07 16:28   ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2016-01-07 16:28 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, mgorman, rientjes, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Tue, Dec 29, 2015 at 10:58:22PM +0900, Tetsuo Handa wrote:
> >From 8bb9e36891a803e82c589ef78077838026ce0f7d Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Tue, 29 Dec 2015 22:20:58 +0900
> Subject: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
> 
> The OOM reaper kernel thread can reclaim OOM victim's memory before the victim
> terminates. But since oom_kill_process() tries to kill children of the memory
> hog process first, the OOM reaper can not reclaim enough memory for terminating
> the victim if the victim is consuming little memory. The result is OOM livelock
> as usual, for timeout based next OOM victim selection is not implemented.

What we should be doing is have the OOM reaper clear TIF_MEMDIE after
it's done. There is no reason to wait for and prioritize the exit of a
task that doesn't even have memory anymore. Once a task's memory has
been reaped, subsequent OOM invocations should evaluate anew the most
desirable OOM victim.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2016-01-07 16:28   ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2016-01-07 16:28 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, mgorman, rientjes, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Tue, Dec 29, 2015 at 10:58:22PM +0900, Tetsuo Handa wrote:
> >From 8bb9e36891a803e82c589ef78077838026ce0f7d Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Tue, 29 Dec 2015 22:20:58 +0900
> Subject: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
> 
> The OOM reaper kernel thread can reclaim OOM victim's memory before the victim
> terminates. But since oom_kill_process() tries to kill children of the memory
> hog process first, the OOM reaper can not reclaim enough memory for terminating
> the victim if the victim is consuming little memory. The result is OOM livelock
> as usual, for timeout based next OOM victim selection is not implemented.

What we should be doing is have the OOM reaper clear TIF_MEMDIE after
it's done. There is no reason to wait for and prioritize the exit of a
task that doesn't even have memory anymore. Once a task's memory has
been reaped, subsequent OOM invocations should evaluate anew the most
desirable OOM victim.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2] mm,oom: Exclude TIF_MEMDIE processes from candidates.
  2016-01-07 15:44         ` Michal Hocko
@ 2016-01-08 10:09           ` Tetsuo Handa
  -1 siblings, 0 replies; 44+ messages in thread
From: Tetsuo Handa @ 2016-01-08 10:09 UTC (permalink / raw)
  To: mhocko
  Cc: hannes, rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 07-01-16 15:58:41, Michal Hocko wrote:
> > On Thu 07-01-16 22:31:32, Tetsuo Handa wrote:
> > [...]
> > > I think we need to filter at select_bad_process() and oom_kill_process().
> > >
> > > When P has no children, P is chosen and TIF_MEMDIE is set on P. But P can
> > > be chosen forever due to P->signal->oom_score_adj == OOM_SCORE_ADJ_MAX
> > > even if the OOM reaper reclaimed P's mm. We need to ensure that
> > > oom_kill_process() is not called with P if P already has TIF_MEMDIE.
> >
> > Hmm. Any task is allowed to set its oom_score_adj that way and I
> > guess we should really make sure that at least sysrq+f will make some
> > progress. This is what I would do. Again I think this is worth a
> > separate patch. Unless there are any objections I will roll out what I
> > have and post 3 separate patches.
> > ---
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 45e51ad2f7cf..ee34a51bd65a 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -333,6 +333,14 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
> >  		if (points == chosen_points && thread_group_leader(chosen))
> >  			continue;
> >
> > +		/*
> > +		 * If the current major task is already ooom killed and this
> > +		 * is sysrq+f request then we rather choose somebody else
> > +		 * because the current oom victim might be stuck.
> > +		 */
> > +		if (is_sysrq_oom(sc) && test_tsk_thread_flag(p, TIF_MEMDIE))
> > +			continue;
> > +
> >  		chosen = p;
> >  		chosen_points = points;
> >  	}
>
> I guess we can move this up to oom_scan_process_thread already. It would
> be simpler and I it should be also more appropriate because we already
> do sysrq specific handling there:
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 45e51ad2f7cf..a27a43212075 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -277,10 +277,16 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  	/*
>  	 * This task already has access to memory reserves and is being killed.
>  	 * Don't allow any other task to have access to the reserves.
> +	 * If we are doing sysrq+f then it doesn't make any sense to check such
> +	 * a task because it might be stuck and unable to terminate while the
> +	 * forced OOM might be the only option left to get the system back to
> +	 * work.
>  	 */

"the forced OOM might be the only option left to get the system back to work"
makes an admission that "it doesn't make sense to wait for such a task forever
but we do not offer other options to get the system back to work". The forced
OOM is not always available on non-desktop systems; we can't ask administrators
to stand by in front of the console twenty-four seven.

We spent more than one year and found several bugs, but we still cannot find
bulletproof OOM handling. We are making the code more and more complicated and
difficult to test all cases. Unspotted corner cases annoy administrators and
troubleshooting staffs at support center (e.g. me). For some systems, papering
over OOM related problems is more important than trying to debug to the last
second. Firstly, we should offer another option to get the system back to work
( http://lkml.kernel.org/r/201601072026.JCJ95845.LHQOFOOSMFtVFJ@I-love.SAKURA.ne.jp )
for such systems. After that, we can try to avoid using such options.

>  	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
>  		if (!is_sysrq_oom(oc))
>  			return OOM_SCAN_ABORT;
> +		else
> +			return OOM_SCAN_CONTINUE;

This preserves possibility of choosing a !TIF_MEMDIE thread which belongs
to a process which at least one of threads is a TIF_MEMDIE thread.
We can't guarantee that find_lock_task_mm() from oom_kill_process() chooses
a !TIF_MEMDIE thread unless we check TIF_MEMDIE at find_lock_task_mm().

If we don't want to require SysRq-f for each thread in a process, updated
patch shown below will guarantee (for both SysRq-f option and another option
in the patch above). We can change like

static struct task_struct *find_lock_non_victim_task_mm(struct task_struct *p)
{
	struct task_struct *t;

	rcu_read_lock();

	for_each_thread(p, t) {
		if (unlikely(test_tsk_thread_flag(t, TIF_MEMDIE)))
			continue;
		task_lock(t);
		if (likely(t->mm))
			goto found;
		task_unlock(t);
	}
	t = NULL;
 found:
	rcu_read_unlock();

	return t;
}

if we want to require SysRq-f for each thread in a process.

>  	}
>  	if (!task->mm)
>  		return OOM_SCAN_CONTINUE;
>
----------------------------------------
>From 1b199467eaf9e3a8cdac5eacde704fbd13969f68 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 8 Jan 2016 12:40:02 +0900
Subject: [PATCH v2] mm,oom: exclude TIF_MEMDIE processes from candidates.

The OOM reaper kernel thread can reclaim OOM victim's memory before the
victim terminates.  But since oom_kill_process() tries to kill children of
the memory hog process first, the OOM reaper can not reclaim enough memory
for terminating the victim if the victim is consuming little memory.  The
result is OOM livelock as usual, for timeout based next OOM victim
selection is not implemented.

While SysRq-f (manual invocation of the OOM killer) can wake up the OOM
killer, the OOM killer chooses the same OOM victim which already has
TIF_MEMDIE.  This is effectively disabling SysRq-f.

This patch excludes TIF_MEMDIE processes from candidates so that the
memory hog process itself will be killed when all children of the memory
hog process got stuck with TIF_MEMDIE pending.

[  120.078776] oom-write invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
[  120.088610] oom-write cpuset=/ mems_allowed=0
[  120.095558] CPU: 0 PID: 9546 Comm: oom-write Not tainted 4.4.0-rc6-next-20151223 #260
(...snipped...)
[  120.194148] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[  120.260191] [ 9546]  1000  9546   541716   453473     896       6        0             0 oom-write
[  120.262166] [ 9547]  1000  9547       40        1       3       2        0             0 write
[  120.264071] [ 9548]  1000  9548       40        1       3       2        0             0 write
[  120.265939] [ 9549]  1000  9549       40        1       4       2        0             0 write
[  120.267794] [ 9550]  1000  9550       40        1       3       2        0             0 write
[  120.269654] [ 9551]  1000  9551       40        1       3       2        0             0 write
[  120.271447] [ 9552]  1000  9552       40        1       3       2        0             0 write
[  120.273220] [ 9553]  1000  9553       40        1       3       2        0             0 write
[  120.274975] [ 9554]  1000  9554       40        1       3       2        0             0 write
[  120.276745] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  120.278516] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  120.280227] Out of memory: Kill process 9546 (oom-write) score 892 or sacrifice child
[  120.282010] Killed process 9549 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  122.506001] systemd-journal invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|GFP_COLD)
[  122.515041] systemd-journal cpuset=/ mems_allowed=0
(...snipped...)
[  122.697515] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  122.699492] [ 9551]  1000  9551       40        1       3       2        0             0 write
[  122.701399] [ 9552]  1000  9552       40        1       3       2        0             0 write
[  122.703282] [ 9553]  1000  9553       40        1       3       2        0             0 write
[  122.705188] [ 9554]  1000  9554       40        1       3       2        0             0 write
[  122.707017] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  122.708842] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  122.710675] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  122.712475] Killed process 9551 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[  139.606508] sysrq: SysRq : Manual OOM execution
[  139.612371] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
[  139.620210] kworker/0:2 cpuset=/ mems_allowed=0
(...snipped...)
[  139.795759] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  139.797649] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  139.799526] [ 9552]  1000  9552       40        1       3       2        0             0 write
[  139.801368] [ 9553]  1000  9553       40        1       3       2        0             0 write
[  139.803249] [ 9554]  1000  9554       40        1       3       2        0             0 write
[  139.805020] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  139.806799] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  139.808524] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  139.810216] Killed process 9552 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  142.571815] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  142.573840] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  142.575754] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  142.577633] [ 9553]  1000  9553       40        1       3       2        0             0 write
[  142.579433] [ 9554]  1000  9554       40        1       3       2        0             0 write
[  142.581250] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  142.583003] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  142.585055] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  142.586796] Killed process 9553 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[  143.599058] sysrq: SysRq : Manual OOM execution
[  143.604300] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
(...snipped...)
[  143.783739] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  143.785691] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  143.787532] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  143.789377] [ 9553]  1000  9553       40        0       3       2        0             0 write
[  143.791172] [ 9554]  1000  9554       40        1       3       2        0             0 write
[  143.792985] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  143.794730] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  143.796723] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  143.798338] Killed process 9554 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[  144.374525] sysrq: SysRq : Manual OOM execution
[  144.379779] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
(...snipped...)
[  144.560718] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  144.562657] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  144.564560] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  144.566369] [ 9553]  1000  9553       40        0       3       2        0             0 write
[  144.568246] [ 9554]  1000  9554       40        0       3       2        0             0 write
[  144.570001] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  144.571794] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  144.573502] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  144.575119] Killed process 9555 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[  145.158485] sysrq: SysRq : Manual OOM execution
[  145.163600] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
(...snipped...)
[  145.346059] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  145.348012] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  145.349954] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  145.351817] [ 9553]  1000  9553       40        0       3       2        0             0 write
[  145.353701] [ 9554]  1000  9554       40        0       3       2        0             0 write
[  145.355568] [ 9555]  1000  9555       40        0       3       2        0             0 write
[  145.357319] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  145.359114] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  145.360733] Killed process 9556 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[  169.158408] sysrq: SysRq : Manual OOM execution
[  169.163612] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
(...snipped...)
[  169.343115] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  169.345053] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  169.346884] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  169.348965] [ 9553]  1000  9553       40        0       3       2        0             0 write
[  169.350893] [ 9554]  1000  9554       40        0       3       2        0             0 write
[  169.352713] [ 9555]  1000  9555       40        0       3       2        0             0 write
[  169.354551] [ 9556]  1000  9556       40        0       3       2        0             0 write
[  169.356450] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  169.358105] Killed process 9551 (write) total-vm:160kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  178.950315] sysrq: SysRq : Manual OOM execution
[  178.955560] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
(...snipped...)
[  179.140752] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  179.142653] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  179.144997] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  179.146849] [ 9553]  1000  9553       40        0       3       2        0             0 write
[  179.148654] [ 9554]  1000  9554       40        0       3       2        0             0 write
[  179.150411] [ 9555]  1000  9555       40        0       3       2        0             0 write
[  179.152291] [ 9556]  1000  9556       40        0       3       2        0             0 write
[  179.154002] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  179.155666] Killed process 9551 (write) total-vm:160kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 mm/oom_kill.c | 30 +++++++++++++++++++++++++++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ef89fda..edce443 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -125,6 +125,30 @@ found:
 }

 /*
+ * Treat the whole process p as unkillable when one of threads has
+ * TIF_MEMDIE pending. Otherwise, we may end up setting TIF_MEMDIE
+ * on the same victim forever (e.g. making SysRq-f unusable).
+ */
+static struct task_struct *find_lock_non_victim_task_mm(struct task_struct *p)
+{
+	struct task_struct *t;
+
+	rcu_read_lock();
+
+	for_each_thread(p, t) {
+		if (likely(!test_tsk_thread_flag(t, TIF_MEMDIE)))
+			continue;
+		t = NULL;
+		goto found;
+	}
+	t = find_lock_task_mm(p);
+ found:
+	rcu_read_unlock();
+
+	return t;
+}
+
+/*
  * order == -1 means the oom kill is required by sysrq, otherwise only
  * for display purposes.
  */
@@ -171,7 +195,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	if (oom_unkillable_task(p, memcg, nodemask))
 		return 0;

-	p = find_lock_task_mm(p);
+	p = find_lock_non_victim_task_mm(p);
 	if (!p)
 		return 0;

@@ -367,7 +391,7 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask)
 		if (oom_unkillable_task(p, memcg, nodemask))
 			continue;

-		task = find_lock_task_mm(p);
+		task = find_lock_non_victim_task_mm(p);
 		if (!task) {
 			/*
 			 * This is a kthread or all of p's threads have already
@@ -708,7 +732,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	}
 	read_unlock(&tasklist_lock);

-	p = find_lock_task_mm(victim);
+	p = find_lock_non_victim_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
 		return;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v2] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2016-01-08 10:09           ` Tetsuo Handa
  0 siblings, 0 replies; 44+ messages in thread
From: Tetsuo Handa @ 2016-01-08 10:09 UTC (permalink / raw)
  To: mhocko
  Cc: hannes, rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 07-01-16 15:58:41, Michal Hocko wrote:
> > On Thu 07-01-16 22:31:32, Tetsuo Handa wrote:
> > [...]
> > > I think we need to filter at select_bad_process() and oom_kill_process().
> > >
> > > When P has no children, P is chosen and TIF_MEMDIE is set on P. But P can
> > > be chosen forever due to P->signal->oom_score_adj == OOM_SCORE_ADJ_MAX
> > > even if the OOM reaper reclaimed P's mm. We need to ensure that
> > > oom_kill_process() is not called with P if P already has TIF_MEMDIE.
> >
> > Hmm. Any task is allowed to set its oom_score_adj that way and I
> > guess we should really make sure that at least sysrq+f will make some
> > progress. This is what I would do. Again I think this is worth a
> > separate patch. Unless there are any objections I will roll out what I
> > have and post 3 separate patches.
> > ---
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 45e51ad2f7cf..ee34a51bd65a 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -333,6 +333,14 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
> >  		if (points == chosen_points && thread_group_leader(chosen))
> >  			continue;
> >
> > +		/*
> > +		 * If the current major task is already ooom killed and this
> > +		 * is sysrq+f request then we rather choose somebody else
> > +		 * because the current oom victim might be stuck.
> > +		 */
> > +		if (is_sysrq_oom(sc) && test_tsk_thread_flag(p, TIF_MEMDIE))
> > +			continue;
> > +
> >  		chosen = p;
> >  		chosen_points = points;
> >  	}
>
> I guess we can move this up to oom_scan_process_thread already. It would
> be simpler and I it should be also more appropriate because we already
> do sysrq specific handling there:
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 45e51ad2f7cf..a27a43212075 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -277,10 +277,16 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  	/*
>  	 * This task already has access to memory reserves and is being killed.
>  	 * Don't allow any other task to have access to the reserves.
> +	 * If we are doing sysrq+f then it doesn't make any sense to check such
> +	 * a task because it might be stuck and unable to terminate while the
> +	 * forced OOM might be the only option left to get the system back to
> +	 * work.
>  	 */

"the forced OOM might be the only option left to get the system back to work"
makes an admission that "it doesn't make sense to wait for such a task forever
but we do not offer other options to get the system back to work". The forced
OOM is not always available on non-desktop systems; we can't ask administrators
to stand by in front of the console twenty-four seven.

We spent more than one year and found several bugs, but we still cannot find
bulletproof OOM handling. We are making the code more and more complicated and
difficult to test all cases. Unspotted corner cases annoy administrators and
troubleshooting staffs at support center (e.g. me). For some systems, papering
over OOM related problems is more important than trying to debug to the last
second. Firstly, we should offer another option to get the system back to work
( http://lkml.kernel.org/r/201601072026.JCJ95845.LHQOFOOSMFtVFJ@I-love.SAKURA.ne.jp )
for such systems. After that, we can try to avoid using such options.

>  	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
>  		if (!is_sysrq_oom(oc))
>  			return OOM_SCAN_ABORT;
> +		else
> +			return OOM_SCAN_CONTINUE;

This preserves possibility of choosing a !TIF_MEMDIE thread which belongs
to a process which at least one of threads is a TIF_MEMDIE thread.
We can't guarantee that find_lock_task_mm() from oom_kill_process() chooses
a !TIF_MEMDIE thread unless we check TIF_MEMDIE at find_lock_task_mm().

If we don't want to require SysRq-f for each thread in a process, updated
patch shown below will guarantee (for both SysRq-f option and another option
in the patch above). We can change like

static struct task_struct *find_lock_non_victim_task_mm(struct task_struct *p)
{
	struct task_struct *t;

	rcu_read_lock();

	for_each_thread(p, t) {
		if (unlikely(test_tsk_thread_flag(t, TIF_MEMDIE)))
			continue;
		task_lock(t);
		if (likely(t->mm))
			goto found;
		task_unlock(t);
	}
	t = NULL;
 found:
	rcu_read_unlock();

	return t;
}

if we want to require SysRq-f for each thread in a process.

>  	}
>  	if (!task->mm)
>  		return OOM_SCAN_CONTINUE;
>
----------------------------------------
>From 1b199467eaf9e3a8cdac5eacde704fbd13969f68 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 8 Jan 2016 12:40:02 +0900
Subject: [PATCH v2] mm,oom: exclude TIF_MEMDIE processes from candidates.

The OOM reaper kernel thread can reclaim OOM victim's memory before the
victim terminates.  But since oom_kill_process() tries to kill children of
the memory hog process first, the OOM reaper can not reclaim enough memory
for terminating the victim if the victim is consuming little memory.  The
result is OOM livelock as usual, for timeout based next OOM victim
selection is not implemented.

While SysRq-f (manual invocation of the OOM killer) can wake up the OOM
killer, the OOM killer chooses the same OOM victim which already has
TIF_MEMDIE.  This is effectively disabling SysRq-f.

This patch excludes TIF_MEMDIE processes from candidates so that the
memory hog process itself will be killed when all children of the memory
hog process got stuck with TIF_MEMDIE pending.

[  120.078776] oom-write invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
[  120.088610] oom-write cpuset=/ mems_allowed=0
[  120.095558] CPU: 0 PID: 9546 Comm: oom-write Not tainted 4.4.0-rc6-next-20151223 #260
(...snipped...)
[  120.194148] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[  120.260191] [ 9546]  1000  9546   541716   453473     896       6        0             0 oom-write
[  120.262166] [ 9547]  1000  9547       40        1       3       2        0             0 write
[  120.264071] [ 9548]  1000  9548       40        1       3       2        0             0 write
[  120.265939] [ 9549]  1000  9549       40        1       4       2        0             0 write
[  120.267794] [ 9550]  1000  9550       40        1       3       2        0             0 write
[  120.269654] [ 9551]  1000  9551       40        1       3       2        0             0 write
[  120.271447] [ 9552]  1000  9552       40        1       3       2        0             0 write
[  120.273220] [ 9553]  1000  9553       40        1       3       2        0             0 write
[  120.274975] [ 9554]  1000  9554       40        1       3       2        0             0 write
[  120.276745] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  120.278516] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  120.280227] Out of memory: Kill process 9546 (oom-write) score 892 or sacrifice child
[  120.282010] Killed process 9549 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  122.506001] systemd-journal invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|GFP_COLD)
[  122.515041] systemd-journal cpuset=/ mems_allowed=0
(...snipped...)
[  122.697515] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  122.699492] [ 9551]  1000  9551       40        1       3       2        0             0 write
[  122.701399] [ 9552]  1000  9552       40        1       3       2        0             0 write
[  122.703282] [ 9553]  1000  9553       40        1       3       2        0             0 write
[  122.705188] [ 9554]  1000  9554       40        1       3       2        0             0 write
[  122.707017] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  122.708842] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  122.710675] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  122.712475] Killed process 9551 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[  139.606508] sysrq: SysRq : Manual OOM execution
[  139.612371] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
[  139.620210] kworker/0:2 cpuset=/ mems_allowed=0
(...snipped...)
[  139.795759] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  139.797649] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  139.799526] [ 9552]  1000  9552       40        1       3       2        0             0 write
[  139.801368] [ 9553]  1000  9553       40        1       3       2        0             0 write
[  139.803249] [ 9554]  1000  9554       40        1       3       2        0             0 write
[  139.805020] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  139.806799] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  139.808524] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  139.810216] Killed process 9552 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  142.571815] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  142.573840] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  142.575754] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  142.577633] [ 9553]  1000  9553       40        1       3       2        0             0 write
[  142.579433] [ 9554]  1000  9554       40        1       3       2        0             0 write
[  142.581250] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  142.583003] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  142.585055] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  142.586796] Killed process 9553 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[  143.599058] sysrq: SysRq : Manual OOM execution
[  143.604300] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
(...snipped...)
[  143.783739] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  143.785691] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  143.787532] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  143.789377] [ 9553]  1000  9553       40        0       3       2        0             0 write
[  143.791172] [ 9554]  1000  9554       40        1       3       2        0             0 write
[  143.792985] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  143.794730] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  143.796723] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  143.798338] Killed process 9554 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[  144.374525] sysrq: SysRq : Manual OOM execution
[  144.379779] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
(...snipped...)
[  144.560718] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  144.562657] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  144.564560] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  144.566369] [ 9553]  1000  9553       40        0       3       2        0             0 write
[  144.568246] [ 9554]  1000  9554       40        0       3       2        0             0 write
[  144.570001] [ 9555]  1000  9555       40        1       3       2        0             0 write
[  144.571794] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  144.573502] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  144.575119] Killed process 9555 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[  145.158485] sysrq: SysRq : Manual OOM execution
[  145.163600] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
(...snipped...)
[  145.346059] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  145.348012] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  145.349954] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  145.351817] [ 9553]  1000  9553       40        0       3       2        0             0 write
[  145.353701] [ 9554]  1000  9554       40        0       3       2        0             0 write
[  145.355568] [ 9555]  1000  9555       40        0       3       2        0             0 write
[  145.357319] [ 9556]  1000  9556       40        1       3       2        0             0 write
[  145.359114] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  145.360733] Killed process 9556 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[  169.158408] sysrq: SysRq : Manual OOM execution
[  169.163612] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
(...snipped...)
[  169.343115] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  169.345053] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  169.346884] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  169.348965] [ 9553]  1000  9553       40        0       3       2        0             0 write
[  169.350893] [ 9554]  1000  9554       40        0       3       2        0             0 write
[  169.352713] [ 9555]  1000  9555       40        0       3       2        0             0 write
[  169.354551] [ 9556]  1000  9556       40        0       3       2        0             0 write
[  169.356450] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  169.358105] Killed process 9551 (write) total-vm:160kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  178.950315] sysrq: SysRq : Manual OOM execution
[  178.955560] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
(...snipped...)
[  179.140752] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
[  179.142653] [ 9551]  1000  9551       40        0       3       2        0             0 write
[  179.144997] [ 9552]  1000  9552       40        0       3       2        0             0 write
[  179.146849] [ 9553]  1000  9553       40        0       3       2        0             0 write
[  179.148654] [ 9554]  1000  9554       40        0       3       2        0             0 write
[  179.150411] [ 9555]  1000  9555       40        0       3       2        0             0 write
[  179.152291] [ 9556]  1000  9556       40        0       3       2        0             0 write
[  179.154002] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
[  179.155666] Killed process 9551 (write) total-vm:160kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 mm/oom_kill.c | 30 +++++++++++++++++++++++++++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ef89fda..edce443 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -125,6 +125,30 @@ found:
 }

 /*
+ * Treat the whole process p as unkillable when one of threads has
+ * TIF_MEMDIE pending. Otherwise, we may end up setting TIF_MEMDIE
+ * on the same victim forever (e.g. making SysRq-f unusable).
+ */
+static struct task_struct *find_lock_non_victim_task_mm(struct task_struct *p)
+{
+	struct task_struct *t;
+
+	rcu_read_lock();
+
+	for_each_thread(p, t) {
+		if (likely(!test_tsk_thread_flag(t, TIF_MEMDIE)))
+			continue;
+		t = NULL;
+		goto found;
+	}
+	t = find_lock_task_mm(p);
+ found:
+	rcu_read_unlock();
+
+	return t;
+}
+
+/*
  * order == -1 means the oom kill is required by sysrq, otherwise only
  * for display purposes.
  */
@@ -171,7 +195,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	if (oom_unkillable_task(p, memcg, nodemask))
 		return 0;

-	p = find_lock_task_mm(p);
+	p = find_lock_non_victim_task_mm(p);
 	if (!p)
 		return 0;

@@ -367,7 +391,7 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask)
 		if (oom_unkillable_task(p, memcg, nodemask))
 			continue;

-		task = find_lock_task_mm(p);
+		task = find_lock_non_victim_task_mm(p);
 		if (!task) {
 			/*
 			 * This is a kthread or all of p's threads have already
@@ -708,7 +732,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	}
 	read_unlock(&tasklist_lock);

-	p = find_lock_task_mm(victim);
+	p = find_lock_non_victim_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
 		return;
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
  2016-01-07 16:28   ` Johannes Weiner
@ 2016-01-08 12:37     ` Michal Hocko
  -1 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-01-08 12:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, akpm, mgorman, rientjes, torvalds, oleg, hughd,
	andrea, riel, linux-mm, linux-kernel

On Thu 07-01-16 11:28:15, Johannes Weiner wrote:
> On Tue, Dec 29, 2015 at 10:58:22PM +0900, Tetsuo Handa wrote:
> > >From 8bb9e36891a803e82c589ef78077838026ce0f7d Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date: Tue, 29 Dec 2015 22:20:58 +0900
> > Subject: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
> > 
> > The OOM reaper kernel thread can reclaim OOM victim's memory before the victim
> > terminates. But since oom_kill_process() tries to kill children of the memory
> > hog process first, the OOM reaper can not reclaim enough memory for terminating
> > the victim if the victim is consuming little memory. The result is OOM livelock
> > as usual, for timeout based next OOM victim selection is not implemented.
> 
> What we should be doing is have the OOM reaper clear TIF_MEMDIE after
> it's done. There is no reason to wait for and prioritize the exit of a
> task that doesn't even have memory anymore. Once a task's memory has
> been reaped, subsequent OOM invocations should evaluate anew the most
> desirable OOM victim.

This is an interesting idea. It definitely sounds better than timeout
based solutions. I will cook up a patch for this. The API between oom
killer and the reaper has to change slightly but that shouldn't be a big
deal.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2016-01-08 12:37     ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-01-08 12:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, akpm, mgorman, rientjes, torvalds, oleg, hughd,
	andrea, riel, linux-mm, linux-kernel

On Thu 07-01-16 11:28:15, Johannes Weiner wrote:
> On Tue, Dec 29, 2015 at 10:58:22PM +0900, Tetsuo Handa wrote:
> > >From 8bb9e36891a803e82c589ef78077838026ce0f7d Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date: Tue, 29 Dec 2015 22:20:58 +0900
> > Subject: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
> > 
> > The OOM reaper kernel thread can reclaim OOM victim's memory before the victim
> > terminates. But since oom_kill_process() tries to kill children of the memory
> > hog process first, the OOM reaper can not reclaim enough memory for terminating
> > the victim if the victim is consuming little memory. The result is OOM livelock
> > as usual, for timeout based next OOM victim selection is not implemented.
> 
> What we should be doing is have the OOM reaper clear TIF_MEMDIE after
> it's done. There is no reason to wait for and prioritize the exit of a
> task that doesn't even have memory anymore. Once a task's memory has
> been reaped, subsequent OOM invocations should evaluate anew the most
> desirable OOM victim.

This is an interesting idea. It definitely sounds better than timeout
based solutions. I will cook up a patch for this. The API between oom
killer and the reaper has to change slightly but that shouldn't be a big
deal.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
  2016-01-08 12:37     ` Michal Hocko
@ 2016-01-08 13:14       ` Tetsuo Handa
  -1 siblings, 0 replies; 44+ messages in thread
From: Tetsuo Handa @ 2016-01-08 13:14 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: akpm, mgorman, rientjes, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 07-01-16 11:28:15, Johannes Weiner wrote:
> > On Tue, Dec 29, 2015 at 10:58:22PM +0900, Tetsuo Handa wrote:
> > > >From 8bb9e36891a803e82c589ef78077838026ce0f7d Mon Sep 17 00:00:00 2001
> > > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Date: Tue, 29 Dec 2015 22:20:58 +0900
> > > Subject: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
> > > 
> > > The OOM reaper kernel thread can reclaim OOM victim's memory before the victim
> > > terminates. But since oom_kill_process() tries to kill children of the memory
> > > hog process first, the OOM reaper can not reclaim enough memory for terminating
> > > the victim if the victim is consuming little memory. The result is OOM livelock
> > > as usual, for timeout based next OOM victim selection is not implemented.
> > 
> > What we should be doing is have the OOM reaper clear TIF_MEMDIE after
> > it's done. There is no reason to wait for and prioritize the exit of a
> > task that doesn't even have memory anymore. Once a task's memory has
> > been reaped, subsequent OOM invocations should evaluate anew the most
> > desirable OOM victim.
> 
> This is an interesting idea. It definitely sounds better than timeout
> based solutions. I will cook up a patch for this. The API between oom
> killer and the reaper has to change slightly but that shouldn't be a big
> deal.

That is part of what I suggested at
http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp .
| What about marking current OOM victim unkillable by updating
| victim->signal->oom_score_adj to OOM_SCORE_ADJ_MIN and clearing victim's
| TIF_MEMDIE flag when the victim is still alive for a second after
| oom_reap_vmas() completed?

Can we update victim's oom_score_adj as well? Otherwise, the OOM killer
might choose the same victim if victim's oom_score_adj was set to 1000.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2016-01-08 13:14       ` Tetsuo Handa
  0 siblings, 0 replies; 44+ messages in thread
From: Tetsuo Handa @ 2016-01-08 13:14 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: akpm, mgorman, rientjes, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 07-01-16 11:28:15, Johannes Weiner wrote:
> > On Tue, Dec 29, 2015 at 10:58:22PM +0900, Tetsuo Handa wrote:
> > > >From 8bb9e36891a803e82c589ef78077838026ce0f7d Mon Sep 17 00:00:00 2001
> > > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Date: Tue, 29 Dec 2015 22:20:58 +0900
> > > Subject: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
> > > 
> > > The OOM reaper kernel thread can reclaim OOM victim's memory before the victim
> > > terminates. But since oom_kill_process() tries to kill children of the memory
> > > hog process first, the OOM reaper can not reclaim enough memory for terminating
> > > the victim if the victim is consuming little memory. The result is OOM livelock
> > > as usual, for timeout based next OOM victim selection is not implemented.
> > 
> > What we should be doing is have the OOM reaper clear TIF_MEMDIE after
> > it's done. There is no reason to wait for and prioritize the exit of a
> > task that doesn't even have memory anymore. Once a task's memory has
> > been reaped, subsequent OOM invocations should evaluate anew the most
> > desirable OOM victim.
> 
> This is an interesting idea. It definitely sounds better than timeout
> based solutions. I will cook up a patch for this. The API between oom
> killer and the reaper has to change slightly but that shouldn't be a big
> deal.

That is part of what I suggested at
http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp .
| What about marking current OOM victim unkillable by updating
| victim->signal->oom_score_adj to OOM_SCORE_ADJ_MIN and clearing victim's
| TIF_MEMDIE flag when the victim is still alive for a second after
| oom_reap_vmas() completed?

Can we update victim's oom_score_adj as well? Otherwise, the OOM killer
might choose the same victim if victim's oom_score_adj was set to 1000.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
  2016-01-08 13:14       ` Tetsuo Handa
@ 2016-01-08 13:41         ` Michal Hocko
  -1 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-01-08 13:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, akpm, mgorman, rientjes, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Fri 08-01-16 22:14:54, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 07-01-16 11:28:15, Johannes Weiner wrote:
> > > On Tue, Dec 29, 2015 at 10:58:22PM +0900, Tetsuo Handa wrote:
> > > > >From 8bb9e36891a803e82c589ef78077838026ce0f7d Mon Sep 17 00:00:00 2001
> > > > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > > Date: Tue, 29 Dec 2015 22:20:58 +0900
> > > > Subject: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
> > > > 
> > > > The OOM reaper kernel thread can reclaim OOM victim's memory before the victim
> > > > terminates. But since oom_kill_process() tries to kill children of the memory
> > > > hog process first, the OOM reaper can not reclaim enough memory for terminating
> > > > the victim if the victim is consuming little memory. The result is OOM livelock
> > > > as usual, for timeout based next OOM victim selection is not implemented.
> > > 
> > > What we should be doing is have the OOM reaper clear TIF_MEMDIE after
> > > it's done. There is no reason to wait for and prioritize the exit of a
> > > task that doesn't even have memory anymore. Once a task's memory has
> > > been reaped, subsequent OOM invocations should evaluate anew the most
> > > desirable OOM victim.
> > 
> > This is an interesting idea. It definitely sounds better than timeout
> > based solutions. I will cook up a patch for this. The API between oom
> > killer and the reaper has to change slightly but that shouldn't be a big
> > deal.
> 
> That is part of what I suggested at
> http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp .
> | What about marking current OOM victim unkillable by updating
> | victim->signal->oom_score_adj to OOM_SCORE_ADJ_MIN and clearing victim's
> | TIF_MEMDIE flag when the victim is still alive for a second after
> | oom_reap_vmas() completed?

Sorry, I must have missed this part. I have added your Suggested-by to the
patch description.

> Can we update victim's oom_score_adj as well? Otherwise, the OOM killer
> might choose the same victim if victim's oom_score_adj was set to 1000.

Yes I've done that in the patch I am testing ATM.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2016-01-08 13:41         ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-01-08 13:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, akpm, mgorman, rientjes, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Fri 08-01-16 22:14:54, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 07-01-16 11:28:15, Johannes Weiner wrote:
> > > On Tue, Dec 29, 2015 at 10:58:22PM +0900, Tetsuo Handa wrote:
> > > > >From 8bb9e36891a803e82c589ef78077838026ce0f7d Mon Sep 17 00:00:00 2001
> > > > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > > Date: Tue, 29 Dec 2015 22:20:58 +0900
> > > > Subject: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
> > > > 
> > > > The OOM reaper kernel thread can reclaim OOM victim's memory before the victim
> > > > terminates. But since oom_kill_process() tries to kill children of the memory
> > > > hog process first, the OOM reaper can not reclaim enough memory for terminating
> > > > the victim if the victim is consuming little memory. The result is OOM livelock
> > > > as usual, for timeout based next OOM victim selection is not implemented.
> > > 
> > > What we should be doing is have the OOM reaper clear TIF_MEMDIE after
> > > it's done. There is no reason to wait for and prioritize the exit of a
> > > task that doesn't even have memory anymore. Once a task's memory has
> > > been reaped, subsequent OOM invocations should evaluate anew the most
> > > desirable OOM victim.
> > 
> > This is an interesting idea. It definitely sounds better than timeout
> > based solutions. I will cook up a patch for this. The API between oom
> > killer and the reaper has to change slightly but that shouldn't be a big
> > deal.
> 
> That is part of what I suggested at
> http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp .
> | What about marking current OOM victim unkillable by updating
> | victim->signal->oom_score_adj to OOM_SCORE_ADJ_MIN and clearing victim's
> | TIF_MEMDIE flag when the victim is still alive for a second after
> | oom_reap_vmas() completed?

Sorry, I must have missed this part. I have added your Suggested-by to the
patch description.

> Can we update victim's oom_score_adj as well? Otherwise, the OOM killer
> might choose the same victim if victim's oom_score_adj was set to 1000.

Yes I've done that in the patch I am testing ATM.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
  2016-01-07 15:38         ` Tetsuo Handa
@ 2016-01-11 15:18           ` Michal Hocko
  -1 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-01-11 15:18 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

On Fri 08-01-16 00:38:43, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > @@ -333,6 +333,14 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
> >  		if (points == chosen_points && thread_group_leader(chosen))
> >  			continue;
> >  
> > +		/*
> > +		 * If the current major task is already ooom killed and this
> > +		 * is sysrq+f request then we rather choose somebody else
> > +		 * because the current oom victim might be stuck.
> > +		 */
> > +		if (is_sysrq_oom(sc) && test_tsk_thread_flag(p, TIF_MEMDIE))
> > +			continue;
> > +
> >  		chosen = p;
> >  		chosen_points = points;
> >  	}
> 
> Do we want to require SysRq-f for each thread in a process?
> If g has 1024 p, dump_tasks() will do
> 
>   pr_info("[%5d] %5d %5d %8lu %8lu %7ld %7ld %8lu         %5hd %s\n",
> 
> for 1024 times? I think one SysRq-f per one process is sufficient.

I am not following you here. If we kill the process the whole process
group (aka all threads) will get killed which ever thread we happen to
send the sigkill to.
 
> How can we guarantee that find_lock_task_mm() from oom_kill_process()
> chooses !TIF_MEMDIE thread when try_to_sacrifice_child() somehow chose
> !TIF_MEMDIE thread? I think choosing !TIF_MEMDIE thread at
> find_lock_task_mm() is the simplest way.

find_lock_task_mm chosing TIF_MEMDIE thread shouldn't change anything
because the whole thread group will go down anyway. If you want to
guarantee that the sysrq+f never choses a task which has a TIF_MEMDIE
thread then we would have to check for fatal_signal_pending as well
AFAIU. Fiddling with find find_lock_task_mm will not help you though
unless I am missing something.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2016-01-11 15:18           ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-01-11 15:18 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

On Fri 08-01-16 00:38:43, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > @@ -333,6 +333,14 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
> >  		if (points == chosen_points && thread_group_leader(chosen))
> >  			continue;
> >  
> > +		/*
> > +		 * If the current major task is already ooom killed and this
> > +		 * is sysrq+f request then we rather choose somebody else
> > +		 * because the current oom victim might be stuck.
> > +		 */
> > +		if (is_sysrq_oom(sc) && test_tsk_thread_flag(p, TIF_MEMDIE))
> > +			continue;
> > +
> >  		chosen = p;
> >  		chosen_points = points;
> >  	}
> 
> Do we want to require SysRq-f for each thread in a process?
> If g has 1024 p, dump_tasks() will do
> 
>   pr_info("[%5d] %5d %5d %8lu %8lu %7ld %7ld %8lu         %5hd %s\n",
> 
> for 1024 times? I think one SysRq-f per one process is sufficient.

I am not following you here. If we kill the process the whole process
group (aka all threads) will get killed which ever thread we happen to
send the sigkill to.
 
> How can we guarantee that find_lock_task_mm() from oom_kill_process()
> chooses !TIF_MEMDIE thread when try_to_sacrifice_child() somehow chose
> !TIF_MEMDIE thread? I think choosing !TIF_MEMDIE thread at
> find_lock_task_mm() is the simplest way.

find_lock_task_mm chosing TIF_MEMDIE thread shouldn't change anything
because the whole thread group will go down anyway. If you want to
guarantee that the sysrq+f never choses a task which has a TIF_MEMDIE
thread then we would have to check for fatal_signal_pending as well
AFAIU. Fiddling with find find_lock_task_mm will not help you though
unless I am missing something.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
  2016-01-11 15:18           ` Michal Hocko
@ 2016-01-12 11:32             ` Tetsuo Handa
  -1 siblings, 0 replies; 44+ messages in thread
From: Tetsuo Handa @ 2016-01-12 11:32 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 08-01-16 00:38:43, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > @@ -333,6 +333,14 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
> > >  		if (points == chosen_points && thread_group_leader(chosen))
> > >  			continue;
> > >  
> > > +		/*
> > > +		 * If the current major task is already ooom killed and this
> > > +		 * is sysrq+f request then we rather choose somebody else
> > > +		 * because the current oom victim might be stuck.
> > > +		 */
> > > +		if (is_sysrq_oom(sc) && test_tsk_thread_flag(p, TIF_MEMDIE))
> > > +			continue;
> > > +
> > >  		chosen = p;
> > >  		chosen_points = points;
> > >  	}
> > 
> > Do we want to require SysRq-f for each thread in a process?
> > If g has 1024 p, dump_tasks() will do
> > 
> >   pr_info("[%5d] %5d %5d %8lu %8lu %7ld %7ld %8lu         %5hd %s\n",
> > 
> > for 1024 times? I think one SysRq-f per one process is sufficient.
> 
> I am not following you here. If we kill the process the whole process
> group (aka all threads) will get killed which ever thread we happen to
> send the sigkill to.

Please distinguish "sending SIGKILL to a process" and "all threads in that
process terminate". do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true)
sends SIGKILL to a victim process, but it does not guarantee that all
threads in that process terminate even if the OOM reaper reclaimed memory.
That's when SysRq-f (and timeout based next victim selection) is needed
but currently SysRq-f forever continues selecting incorrect process.

I can observe SysRq-f is disabled
(Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160112.txt.xz .)
----------
[   86.767482] a.out invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
[   86.769905] a.out cpuset=/ mems_allowed=0
[   86.771393] CPU: 2 PID: 9573 Comm: a.out Not tainted 4.4.0-next-20160112+ #279
(...snipped...)
[   86.874710] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[   86.945286] [ 9573]  1000  9573   541717   402522     796       6        0             0 a.out
[   86.947457] [ 9574]  1000  9574     1078       21       7       3        0             0 a.out
[   86.949568] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[   86.951538] Killed process 9574 (a.out) total-vm:4312kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[   86.955296] systemd-journal invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|GFP_COLD)
[   86.958035] systemd-journal cpuset=/ mems_allowed=0
(...snipped...)
[   87.128808] [ 9573]  1000  9573   541717   402522     796       6        0             0 a.out
[   87.130926] [ 9575]  1000  9574     1078        0       7       3        0             0 a.out
[   87.133055] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[   87.134989] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  116.979564] sysrq: SysRq : Manual OOM execution
[  116.984119] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
[  116.986367] kworker/0:8 cpuset=/ mems_allowed=0
(...snipped...)
[  117.157045] [ 9573]  1000  9573   541717   402522     797       6        0             0 a.out
[  117.159191] [ 9575]  1000  9574     1078        0       7       3        0             0 a.out
[  117.161302] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[  117.163250] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  119.043685] sysrq: SysRq : Manual OOM execution
[  119.046239] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
[  119.048453] kworker/0:8 cpuset=/ mems_allowed=0
(...snipped...)
[  119.215982] [ 9573]  1000  9573   541717   402522     797       6        0             0 a.out
[  119.218122] [ 9575]  1000  9574     1078        0       7       3        0             0 a.out
[  119.220237] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[  119.222129] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  120.179644] sysrq: SysRq : Manual OOM execution
[  120.206938] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
[  120.209152] kworker/0:8 cpuset=/ mems_allowed=0
(...snipped...)
[  120.376821] [ 9573]  1000  9573   541717   402522     797       6        0             0 a.out
[  120.378924] [ 9575]  1000  9574     1078        0       7       3        0             0 a.out
[  120.381065] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[  120.382929] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  121.235296] sysrq: SysRq : Manual OOM execution
[  121.252742] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
[  121.254955] kworker/0:8 cpuset=/ mems_allowed=0
(...snipped...)
[  141.024984] a.out           D ffff88007c417948     0  9573   8117 0x00000080
[  141.026830]  ffff88007c417948 ffff880076cac2c0 ffff880076c442c0 ffff88007c418000
[  141.028789]  ffff88007c417980 ffff88007fc90240 00000000fffd7aa1 00000000000006bc
[  141.030746]  ffff88007c417960 ffffffff816fc1a7 ffff88007fc90240 ffff88007c417a08
[  141.032703] Call Trace:
[  141.033653]  [<ffffffff816fc1a7>] schedule+0x37/0x90
[  141.035056]  [<ffffffff81700567>] schedule_timeout+0x117/0x1c0
[  141.036629]  [<ffffffff810e1310>] ? init_timer_key+0x40/0x40
[  141.038182]  [<ffffffff81700694>] schedule_timeout_uninterruptible+0x24/0x30
[  141.039963]  [<ffffffff8114944b>] __alloc_pages_nodemask+0x91b/0xd90
[  141.041631]  [<ffffffff811925e6>] alloc_pages_vma+0xb6/0x290
[  141.043173]  [<ffffffff811711d0>] handle_mm_fault+0x1180/0x1630
[  141.044770]  [<ffffffff811700a4>] ? handle_mm_fault+0x54/0x1630
[  141.046355]  [<ffffffff8105a651>] __do_page_fault+0x1a1/0x440
[  141.047915]  [<ffffffff8105a920>] do_page_fault+0x30/0x80
[  141.049408]  [<ffffffff81702307>] ? native_iret+0x7/0x7
[  141.050876]  [<ffffffff817033e8>] page_fault+0x28/0x30
[  141.052327]  [<ffffffff813a6f3d>] ? __clear_user+0x3d/0x70
[  141.053831]  [<ffffffff813ab9e8>] iov_iter_zero+0x68/0x250
[  141.055346]  [<ffffffff814866a8>] read_iter_zero+0x38/0xb0
[  141.056854]  [<ffffffff811c0994>] __vfs_read+0xc4/0xf0
[  141.058295]  [<ffffffff811c154a>] vfs_read+0x7a/0x120
[  141.059711]  [<ffffffff811c1df3>] SyS_read+0x53/0xd0
[  141.061104]  [<ffffffff81701772>] entry_SYSCALL_64_fastpath+0x12/0x76
[  141.062768] a.out           x ffff88007b92fca0     0  9574   9573 0x00000084
[  141.064604]  ffff88007b92fca0 ffff880076cac2c0 ffff88007a862c80 ffff88007b930000
[  141.066555]  ffff88007a863040 ffff88007a863308 ffff88007a862c80 ffff88007cc10000
[  141.068492]  ffff88007b92fcb8 ffffffff816fc1a7 ffff88007a863308 ffff88007b92fd28
[  141.070437] Call Trace:
[  141.071389]  [<ffffffff816fc1a7>] schedule+0x37/0x90
[  141.072788]  [<ffffffff810733fe>] do_exit+0x6be/0xb50
[  141.074198]  [<ffffffff81073917>] do_group_exit+0x47/0xc0
[  141.075676]  [<ffffffff8107f122>] get_signal+0x222/0x7e0
[  141.077135]  [<ffffffff8100f232>] do_signal+0x32/0x6d0
[  141.078570]  [<ffffffff81095cc8>] ? finish_task_switch+0xa8/0x2b0
[  141.080176]  [<ffffffff8106b967>] ? syscall_slow_exit_work+0x4b/0x10d
[  141.081837]  [<ffffffff81095cc8>] ? finish_task_switch+0xa8/0x2b0
[  141.083441]  [<ffffffff8106b8ba>] ? exit_to_usermode_loop+0x2e/0x90
[  141.085063]  [<ffffffff8106b8d8>] exit_to_usermode_loop+0x4c/0x90
[  141.086667]  [<ffffffff8100355b>] syscall_return_slowpath+0xbb/0x130
[  141.088305]  [<ffffffff817018da>] int_ret_from_sys_call+0x25/0x9f
[  141.089896] a.out           D ffff88007be2fab8     0  9575   9573 0x00100084
[  141.091734]  ffff88007be2fab8 ffff880036509640 ffff8800366742c0 ffff88007be30000
[  141.093688]  0000000000000000 7fffffffffffffff ffff88007ff72cb8 ffffffff816fca00
[  141.095743]  ffff88007be2fad0 ffffffff816fc1a7 ffff88007fc17280 ffff88007be2fb70
[  141.097699] Call Trace:
[  141.098649]  [<ffffffff816fca00>] ? bit_wait+0x60/0x60
[  141.100071]  [<ffffffff816fc1a7>] schedule+0x37/0x90
[  141.101453]  [<ffffffff817005c8>] schedule_timeout+0x178/0x1c0
[  141.103001]  [<ffffffff810e81e2>] ? ktime_get+0x102/0x130
[  141.104468]  [<ffffffff810bdfd9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  141.106158]  [<ffffffff810be0ad>] ? trace_hardirqs_on+0xd/0x10
[  141.107698]  [<ffffffff810e8187>] ? ktime_get+0xa7/0x130
[  141.109138]  [<ffffffff811276ea>] ? __delayacct_blkio_start+0x1a/0x30
[  141.110782]  [<ffffffff816fb641>] io_schedule_timeout+0xa1/0x110
[  141.112350]  [<ffffffff816fca16>] bit_wait_io+0x16/0x70
[  141.113774]  [<ffffffff816fc62b>] __wait_on_bit+0x5b/0x90
[  141.115234]  [<ffffffff8113f83a>] ? find_get_pages_tag+0x19a/0x2c0
[  141.116824]  [<ffffffff8113e5c6>] wait_on_page_bit+0xc6/0xf0
[  141.118319]  [<ffffffff810b5830>] ? autoremove_wake_function+0x30/0x30
[  141.119983]  [<ffffffff8113e797>] __filemap_fdatawait_range+0x107/0x190
[  141.121643]  [<ffffffff81140a8c>] ? __filemap_fdatawrite_range+0xcc/0x100
[  141.123352]  [<ffffffff8113e82f>] filemap_fdatawait_range+0xf/0x30
[  141.124955]  [<ffffffff81140bad>] filemap_write_and_wait_range+0x3d/0x60
[  141.126655]  [<ffffffff812b2614>] xfs_file_fsync+0x44/0x180
[  141.128149]  [<ffffffff811f482b>] vfs_fsync_range+0x3b/0xb0
[  141.129646]  [<ffffffff812b4242>] xfs_file_write_iter+0x102/0x140
[  141.131260]  [<ffffffff811c0a87>] __vfs_write+0xc7/0x100
[  141.132702]  [<ffffffff811c168d>] vfs_write+0x9d/0x190
[  141.134108]  [<ffffffff811e104a>] ? __fget_light+0x6a/0x90
[  141.135593]  [<ffffffff811c1ec3>] SyS_write+0x53/0xd0
[  141.136998]  [<ffffffff81701772>] entry_SYSCALL_64_fastpath+0x12/0x76
[  141.138646] a.out           D ffff88007af4fce8     0  9576   9573 0x00000084
[  141.140490]  ffff88007af4fce8 ffff8800366742c0 ffff880036672c80 ffff88007af50000
[  141.142415]  ffff88007d14a5b0 ffff880036672c80 0000000000000246 00000000ffffffff
[  141.144331]  ffff88007af4fd00 ffffffff816fc1a7 ffff88007d14a5a8 ffff88007af4fd10
[  141.146308] Call Trace:
[  141.147261]  [<ffffffff816fc1a7>] schedule+0x37/0x90
[  141.148651]  [<ffffffff816fc4d0>] schedule_preempt_disabled+0x10/0x20
[  141.150326]  [<ffffffff816fd31b>] mutex_lock_nested+0x17b/0x3e0
[  141.151902]  [<ffffffff812b3faf>] ? xfs_file_buffered_aio_write+0x5f/0x1f0
[  141.153647]  [<ffffffff812b3faf>] xfs_file_buffered_aio_write+0x5f/0x1f0
[  141.155397]  [<ffffffff812b41c4>] xfs_file_write_iter+0x84/0x140
[  141.156989]  [<ffffffff811c0a87>] __vfs_write+0xc7/0x100
[  141.158460]  [<ffffffff811c168d>] vfs_write+0x9d/0x190
[  141.159933]  [<ffffffff811e104a>] ? __fget_light+0x6a/0x90
[  141.161417]  [<ffffffff811c1ec3>] SyS_write+0x53/0xd0
[  141.162853]  [<ffffffff81701772>] entry_SYSCALL_64_fastpath+0x12/0x76
(...snipped...)
[  181.154922] [ 9573]  1000  9573   541717   402522     797       6        0             0 a.out
[  181.157145] [ 9575]  1000  9574     1078        0       7       3        0             0 a.out
[  181.159265] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[  181.161160] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  184.227075] sysrq: SysRq : Kill All Tasks
----------
using linux-next-20160112 without "mm,oom: exclude TIF_MEMDIE processes from
candidates." patch, and reproducer shown below.
----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>

static int file_writer(void *unused)
{
	static char buffer[4096] = { }; 
	const int fd = open("/tmp/file",
			    O_WRONLY | O_CREAT | O_APPEND | O_SYNC, 0600);
	while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer));
	return 0;
}

static int memory_consumer(void *unused)
{
	const int fd = open("/dev/zero", O_RDONLY);
	unsigned long size;
	char *buf = NULL;
	sleep(1);
	unlink("/tmp/file");
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	read(fd, buf, size); /* Will cause OOM due to overcommit */
	return 0;
}

int main(int argc, char *argv[])
{
	if (fork() == 0) {
		int i;
		for (i = 0; i < 10; i++) {
			char *cp = malloc(4096);
			if (!cp || clone(file_writer, cp + 4096,
					 CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL) == -1)
				break;
		}
	} else {
		memory_consumer(NULL);
	}
	while (1)
		pause();
}
----------

>  
> > How can we guarantee that find_lock_task_mm() from oom_kill_process()
> > chooses !TIF_MEMDIE thread when try_to_sacrifice_child() somehow chose
> > !TIF_MEMDIE thread? I think choosing !TIF_MEMDIE thread at
> > find_lock_task_mm() is the simplest way.
> 
> find_lock_task_mm chosing TIF_MEMDIE thread shouldn't change anything
> because the whole thread group will go down anyway. If you want to
> guarantee that the sysrq+f never choses a task which has a TIF_MEMDIE
> thread then we would have to check for fatal_signal_pending as well
> AFAIU. Fiddling with find find_lock_task_mm will not help you though
> unless I am missing something.

I do want to guarantee that the SysRq-f (and timeout based next victim
selection) never chooses a process which has a TIF_MEMDIE thread.

I don't like current "oom: clear TIF_MEMDIE after oom_reaper managed to unmap
the address space" patch unless both "mm,oom: exclude TIF_MEMDIE processes from
candidates." patch and "mm,oom: Re-enable OOM killer using timers." patch are
used together. Since your patch covers only likely case, your patch cannot become
alternative to my patches which cover unlikely cases.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2016-01-12 11:32             ` Tetsuo Handa
  0 siblings, 0 replies; 44+ messages in thread
From: Tetsuo Handa @ 2016-01-12 11:32 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 08-01-16 00:38:43, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > @@ -333,6 +333,14 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
> > >  		if (points == chosen_points && thread_group_leader(chosen))
> > >  			continue;
> > >  
> > > +		/*
> > > +		 * If the current major task is already ooom killed and this
> > > +		 * is sysrq+f request then we rather choose somebody else
> > > +		 * because the current oom victim might be stuck.
> > > +		 */
> > > +		if (is_sysrq_oom(sc) && test_tsk_thread_flag(p, TIF_MEMDIE))
> > > +			continue;
> > > +
> > >  		chosen = p;
> > >  		chosen_points = points;
> > >  	}
> > 
> > Do we want to require SysRq-f for each thread in a process?
> > If g has 1024 p, dump_tasks() will do
> > 
> >   pr_info("[%5d] %5d %5d %8lu %8lu %7ld %7ld %8lu         %5hd %s\n",
> > 
> > for 1024 times? I think one SysRq-f per one process is sufficient.
> 
> I am not following you here. If we kill the process the whole process
> group (aka all threads) will get killed which ever thread we happen to
> send the sigkill to.

Please distinguish "sending SIGKILL to a process" and "all threads in that
process terminate". do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true)
sends SIGKILL to a victim process, but it does not guarantee that all
threads in that process terminate even if the OOM reaper reclaimed memory.
That's when SysRq-f (and timeout based next victim selection) is needed
but currently SysRq-f forever continues selecting incorrect process.

I can observe SysRq-f is disabled
(Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160112.txt.xz .)
----------
[   86.767482] a.out invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
[   86.769905] a.out cpuset=/ mems_allowed=0
[   86.771393] CPU: 2 PID: 9573 Comm: a.out Not tainted 4.4.0-next-20160112+ #279
(...snipped...)
[   86.874710] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[   86.945286] [ 9573]  1000  9573   541717   402522     796       6        0             0 a.out
[   86.947457] [ 9574]  1000  9574     1078       21       7       3        0             0 a.out
[   86.949568] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[   86.951538] Killed process 9574 (a.out) total-vm:4312kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[   86.955296] systemd-journal invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|GFP_COLD)
[   86.958035] systemd-journal cpuset=/ mems_allowed=0
(...snipped...)
[   87.128808] [ 9573]  1000  9573   541717   402522     796       6        0             0 a.out
[   87.130926] [ 9575]  1000  9574     1078        0       7       3        0             0 a.out
[   87.133055] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[   87.134989] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  116.979564] sysrq: SysRq : Manual OOM execution
[  116.984119] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
[  116.986367] kworker/0:8 cpuset=/ mems_allowed=0
(...snipped...)
[  117.157045] [ 9573]  1000  9573   541717   402522     797       6        0             0 a.out
[  117.159191] [ 9575]  1000  9574     1078        0       7       3        0             0 a.out
[  117.161302] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[  117.163250] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  119.043685] sysrq: SysRq : Manual OOM execution
[  119.046239] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
[  119.048453] kworker/0:8 cpuset=/ mems_allowed=0
(...snipped...)
[  119.215982] [ 9573]  1000  9573   541717   402522     797       6        0             0 a.out
[  119.218122] [ 9575]  1000  9574     1078        0       7       3        0             0 a.out
[  119.220237] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[  119.222129] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  120.179644] sysrq: SysRq : Manual OOM execution
[  120.206938] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
[  120.209152] kworker/0:8 cpuset=/ mems_allowed=0
(...snipped...)
[  120.376821] [ 9573]  1000  9573   541717   402522     797       6        0             0 a.out
[  120.378924] [ 9575]  1000  9574     1078        0       7       3        0             0 a.out
[  120.381065] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[  120.382929] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  121.235296] sysrq: SysRq : Manual OOM execution
[  121.252742] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
[  121.254955] kworker/0:8 cpuset=/ mems_allowed=0
(...snipped...)
[  141.024984] a.out           D ffff88007c417948     0  9573   8117 0x00000080
[  141.026830]  ffff88007c417948 ffff880076cac2c0 ffff880076c442c0 ffff88007c418000
[  141.028789]  ffff88007c417980 ffff88007fc90240 00000000fffd7aa1 00000000000006bc
[  141.030746]  ffff88007c417960 ffffffff816fc1a7 ffff88007fc90240 ffff88007c417a08
[  141.032703] Call Trace:
[  141.033653]  [<ffffffff816fc1a7>] schedule+0x37/0x90
[  141.035056]  [<ffffffff81700567>] schedule_timeout+0x117/0x1c0
[  141.036629]  [<ffffffff810e1310>] ? init_timer_key+0x40/0x40
[  141.038182]  [<ffffffff81700694>] schedule_timeout_uninterruptible+0x24/0x30
[  141.039963]  [<ffffffff8114944b>] __alloc_pages_nodemask+0x91b/0xd90
[  141.041631]  [<ffffffff811925e6>] alloc_pages_vma+0xb6/0x290
[  141.043173]  [<ffffffff811711d0>] handle_mm_fault+0x1180/0x1630
[  141.044770]  [<ffffffff811700a4>] ? handle_mm_fault+0x54/0x1630
[  141.046355]  [<ffffffff8105a651>] __do_page_fault+0x1a1/0x440
[  141.047915]  [<ffffffff8105a920>] do_page_fault+0x30/0x80
[  141.049408]  [<ffffffff81702307>] ? native_iret+0x7/0x7
[  141.050876]  [<ffffffff817033e8>] page_fault+0x28/0x30
[  141.052327]  [<ffffffff813a6f3d>] ? __clear_user+0x3d/0x70
[  141.053831]  [<ffffffff813ab9e8>] iov_iter_zero+0x68/0x250
[  141.055346]  [<ffffffff814866a8>] read_iter_zero+0x38/0xb0
[  141.056854]  [<ffffffff811c0994>] __vfs_read+0xc4/0xf0
[  141.058295]  [<ffffffff811c154a>] vfs_read+0x7a/0x120
[  141.059711]  [<ffffffff811c1df3>] SyS_read+0x53/0xd0
[  141.061104]  [<ffffffff81701772>] entry_SYSCALL_64_fastpath+0x12/0x76
[  141.062768] a.out           x ffff88007b92fca0     0  9574   9573 0x00000084
[  141.064604]  ffff88007b92fca0 ffff880076cac2c0 ffff88007a862c80 ffff88007b930000
[  141.066555]  ffff88007a863040 ffff88007a863308 ffff88007a862c80 ffff88007cc10000
[  141.068492]  ffff88007b92fcb8 ffffffff816fc1a7 ffff88007a863308 ffff88007b92fd28
[  141.070437] Call Trace:
[  141.071389]  [<ffffffff816fc1a7>] schedule+0x37/0x90
[  141.072788]  [<ffffffff810733fe>] do_exit+0x6be/0xb50
[  141.074198]  [<ffffffff81073917>] do_group_exit+0x47/0xc0
[  141.075676]  [<ffffffff8107f122>] get_signal+0x222/0x7e0
[  141.077135]  [<ffffffff8100f232>] do_signal+0x32/0x6d0
[  141.078570]  [<ffffffff81095cc8>] ? finish_task_switch+0xa8/0x2b0
[  141.080176]  [<ffffffff8106b967>] ? syscall_slow_exit_work+0x4b/0x10d
[  141.081837]  [<ffffffff81095cc8>] ? finish_task_switch+0xa8/0x2b0
[  141.083441]  [<ffffffff8106b8ba>] ? exit_to_usermode_loop+0x2e/0x90
[  141.085063]  [<ffffffff8106b8d8>] exit_to_usermode_loop+0x4c/0x90
[  141.086667]  [<ffffffff8100355b>] syscall_return_slowpath+0xbb/0x130
[  141.088305]  [<ffffffff817018da>] int_ret_from_sys_call+0x25/0x9f
[  141.089896] a.out           D ffff88007be2fab8     0  9575   9573 0x00100084
[  141.091734]  ffff88007be2fab8 ffff880036509640 ffff8800366742c0 ffff88007be30000
[  141.093688]  0000000000000000 7fffffffffffffff ffff88007ff72cb8 ffffffff816fca00
[  141.095743]  ffff88007be2fad0 ffffffff816fc1a7 ffff88007fc17280 ffff88007be2fb70
[  141.097699] Call Trace:
[  141.098649]  [<ffffffff816fca00>] ? bit_wait+0x60/0x60
[  141.100071]  [<ffffffff816fc1a7>] schedule+0x37/0x90
[  141.101453]  [<ffffffff817005c8>] schedule_timeout+0x178/0x1c0
[  141.103001]  [<ffffffff810e81e2>] ? ktime_get+0x102/0x130
[  141.104468]  [<ffffffff810bdfd9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  141.106158]  [<ffffffff810be0ad>] ? trace_hardirqs_on+0xd/0x10
[  141.107698]  [<ffffffff810e8187>] ? ktime_get+0xa7/0x130
[  141.109138]  [<ffffffff811276ea>] ? __delayacct_blkio_start+0x1a/0x30
[  141.110782]  [<ffffffff816fb641>] io_schedule_timeout+0xa1/0x110
[  141.112350]  [<ffffffff816fca16>] bit_wait_io+0x16/0x70
[  141.113774]  [<ffffffff816fc62b>] __wait_on_bit+0x5b/0x90
[  141.115234]  [<ffffffff8113f83a>] ? find_get_pages_tag+0x19a/0x2c0
[  141.116824]  [<ffffffff8113e5c6>] wait_on_page_bit+0xc6/0xf0
[  141.118319]  [<ffffffff810b5830>] ? autoremove_wake_function+0x30/0x30
[  141.119983]  [<ffffffff8113e797>] __filemap_fdatawait_range+0x107/0x190
[  141.121643]  [<ffffffff81140a8c>] ? __filemap_fdatawrite_range+0xcc/0x100
[  141.123352]  [<ffffffff8113e82f>] filemap_fdatawait_range+0xf/0x30
[  141.124955]  [<ffffffff81140bad>] filemap_write_and_wait_range+0x3d/0x60
[  141.126655]  [<ffffffff812b2614>] xfs_file_fsync+0x44/0x180
[  141.128149]  [<ffffffff811f482b>] vfs_fsync_range+0x3b/0xb0
[  141.129646]  [<ffffffff812b4242>] xfs_file_write_iter+0x102/0x140
[  141.131260]  [<ffffffff811c0a87>] __vfs_write+0xc7/0x100
[  141.132702]  [<ffffffff811c168d>] vfs_write+0x9d/0x190
[  141.134108]  [<ffffffff811e104a>] ? __fget_light+0x6a/0x90
[  141.135593]  [<ffffffff811c1ec3>] SyS_write+0x53/0xd0
[  141.136998]  [<ffffffff81701772>] entry_SYSCALL_64_fastpath+0x12/0x76
[  141.138646] a.out           D ffff88007af4fce8     0  9576   9573 0x00000084
[  141.140490]  ffff88007af4fce8 ffff8800366742c0 ffff880036672c80 ffff88007af50000
[  141.142415]  ffff88007d14a5b0 ffff880036672c80 0000000000000246 00000000ffffffff
[  141.144331]  ffff88007af4fd00 ffffffff816fc1a7 ffff88007d14a5a8 ffff88007af4fd10
[  141.146308] Call Trace:
[  141.147261]  [<ffffffff816fc1a7>] schedule+0x37/0x90
[  141.148651]  [<ffffffff816fc4d0>] schedule_preempt_disabled+0x10/0x20
[  141.150326]  [<ffffffff816fd31b>] mutex_lock_nested+0x17b/0x3e0
[  141.151902]  [<ffffffff812b3faf>] ? xfs_file_buffered_aio_write+0x5f/0x1f0
[  141.153647]  [<ffffffff812b3faf>] xfs_file_buffered_aio_write+0x5f/0x1f0
[  141.155397]  [<ffffffff812b41c4>] xfs_file_write_iter+0x84/0x140
[  141.156989]  [<ffffffff811c0a87>] __vfs_write+0xc7/0x100
[  141.158460]  [<ffffffff811c168d>] vfs_write+0x9d/0x190
[  141.159933]  [<ffffffff811e104a>] ? __fget_light+0x6a/0x90
[  141.161417]  [<ffffffff811c1ec3>] SyS_write+0x53/0xd0
[  141.162853]  [<ffffffff81701772>] entry_SYSCALL_64_fastpath+0x12/0x76
(...snipped...)
[  181.154922] [ 9573]  1000  9573   541717   402522     797       6        0             0 a.out
[  181.157145] [ 9575]  1000  9574     1078        0       7       3        0             0 a.out
[  181.159265] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child
[  181.161160] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  184.227075] sysrq: SysRq : Kill All Tasks
----------
using linux-next-20160112 without "mm,oom: exclude TIF_MEMDIE processes from
candidates." patch, and reproducer shown below.
----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>

static int file_writer(void *unused)
{
	static char buffer[4096] = { }; 
	const int fd = open("/tmp/file",
			    O_WRONLY | O_CREAT | O_APPEND | O_SYNC, 0600);
	while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer));
	return 0;
}

static int memory_consumer(void *unused)
{
	const int fd = open("/dev/zero", O_RDONLY);
	unsigned long size;
	char *buf = NULL;
	sleep(1);
	unlink("/tmp/file");
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	read(fd, buf, size); /* Will cause OOM due to overcommit */
	return 0;
}

int main(int argc, char *argv[])
{
	if (fork() == 0) {
		int i;
		for (i = 0; i < 10; i++) {
			char *cp = malloc(4096);
			if (!cp || clone(file_writer, cp + 4096,
					 CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL) == -1)
				break;
		}
	} else {
		memory_consumer(NULL);
	}
	while (1)
		pause();
}
----------

>  
> > How can we guarantee that find_lock_task_mm() from oom_kill_process()
> > chooses !TIF_MEMDIE thread when try_to_sacrifice_child() somehow chose
> > !TIF_MEMDIE thread? I think choosing !TIF_MEMDIE thread at
> > find_lock_task_mm() is the simplest way.
> 
> find_lock_task_mm chosing TIF_MEMDIE thread shouldn't change anything
> because the whole thread group will go down anyway. If you want to
> guarantee that the sysrq+f never choses a task which has a TIF_MEMDIE
> thread then we would have to check for fatal_signal_pending as well
> AFAIU. Fiddling with find find_lock_task_mm will not help you though
> unless I am missing something.

I do want to guarantee that the SysRq-f (and timeout based next victim
selection) never chooses a process which has a TIF_MEMDIE thread.

I don't like current "oom: clear TIF_MEMDIE after oom_reaper managed to unmap
the address space" patch unless both "mm,oom: exclude TIF_MEMDIE processes from
candidates." patch and "mm,oom: Re-enable OOM killer using timers." patch are
used together. Since your patch covers only likely case, your patch cannot become
alternative to my patches which cover unlikely cases.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
  2016-01-12 11:32             ` Tetsuo Handa
@ 2016-01-12 19:52               ` Michal Hocko
  -1 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-01-12 19:52 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

On Tue 12-01-16 20:32:18, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 08-01-16 00:38:43, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > @@ -333,6 +333,14 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
> > > >  		if (points == chosen_points && thread_group_leader(chosen))
> > > >  			continue;
> > > >  
> > > > +		/*
> > > > +		 * If the current major task is already ooom killed and this
> > > > +		 * is sysrq+f request then we rather choose somebody else
> > > > +		 * because the current oom victim might be stuck.
> > > > +		 */
> > > > +		if (is_sysrq_oom(sc) && test_tsk_thread_flag(p, TIF_MEMDIE))
> > > > +			continue;
> > > > +
> > > >  		chosen = p;
> > > >  		chosen_points = points;
> > > >  	}
> > > 
> > > Do we want to require SysRq-f for each thread in a process?
> > > If g has 1024 p, dump_tasks() will do
> > > 
> > >   pr_info("[%5d] %5d %5d %8lu %8lu %7ld %7ld %8lu         %5hd %s\n",
> > > 
> > > for 1024 times? I think one SysRq-f per one process is sufficient.
> > 
> > I am not following you here. If we kill the process the whole process
> > group (aka all threads) will get killed which ever thread we happen to
> > send the sigkill to.
> 
> Please distinguish "sending SIGKILL to a process" and "all threads in that
> process terminate".

I didn't say anything about termination if your read my response again.

[...]

> > > How can we guarantee that find_lock_task_mm() from oom_kill_process()
> > > chooses !TIF_MEMDIE thread when try_to_sacrifice_child() somehow chose
> > > !TIF_MEMDIE thread? I think choosing !TIF_MEMDIE thread at
> > > find_lock_task_mm() is the simplest way.
> > 
> > find_lock_task_mm chosing TIF_MEMDIE thread shouldn't change anything
> > because the whole thread group will go down anyway. If you want to
> > guarantee that the sysrq+f never choses a task which has a TIF_MEMDIE
> > thread then we would have to check for fatal_signal_pending as well
> > AFAIU. Fiddling with find find_lock_task_mm will not help you though
> > unless I am missing something.
> 
> I do want to guarantee that the SysRq-f (and timeout based next victim
> selection) never chooses a process which has a TIF_MEMDIE thread.

Sigh... see what I have written in the paragraph you are replying to...

> I don't like current "oom: clear TIF_MEMDIE after oom_reaper managed to unmap
> the address space" patch unless both "mm,oom: exclude TIF_MEMDIE processes from
> candidates." patch and "mm,oom: Re-enable OOM killer using timers."

Those patches are definitely not a prerequisite from the functional
point of view and putting them together as a prerequisite sounds like
blocking a useful feature without technical grounds to me.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2016-01-12 19:52               ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-01-12 19:52 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

On Tue 12-01-16 20:32:18, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 08-01-16 00:38:43, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > @@ -333,6 +333,14 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
> > > >  		if (points == chosen_points && thread_group_leader(chosen))
> > > >  			continue;
> > > >  
> > > > +		/*
> > > > +		 * If the current major task is already ooom killed and this
> > > > +		 * is sysrq+f request then we rather choose somebody else
> > > > +		 * because the current oom victim might be stuck.
> > > > +		 */
> > > > +		if (is_sysrq_oom(sc) && test_tsk_thread_flag(p, TIF_MEMDIE))
> > > > +			continue;
> > > > +
> > > >  		chosen = p;
> > > >  		chosen_points = points;
> > > >  	}
> > > 
> > > Do we want to require SysRq-f for each thread in a process?
> > > If g has 1024 p, dump_tasks() will do
> > > 
> > >   pr_info("[%5d] %5d %5d %8lu %8lu %7ld %7ld %8lu         %5hd %s\n",
> > > 
> > > for 1024 times? I think one SysRq-f per one process is sufficient.
> > 
> > I am not following you here. If we kill the process the whole process
> > group (aka all threads) will get killed which ever thread we happen to
> > send the sigkill to.
> 
> Please distinguish "sending SIGKILL to a process" and "all threads in that
> process terminate".

I didn't say anything about termination if your read my response again.

[...]

> > > How can we guarantee that find_lock_task_mm() from oom_kill_process()
> > > chooses !TIF_MEMDIE thread when try_to_sacrifice_child() somehow chose
> > > !TIF_MEMDIE thread? I think choosing !TIF_MEMDIE thread at
> > > find_lock_task_mm() is the simplest way.
> > 
> > find_lock_task_mm chosing TIF_MEMDIE thread shouldn't change anything
> > because the whole thread group will go down anyway. If you want to
> > guarantee that the sysrq+f never choses a task which has a TIF_MEMDIE
> > thread then we would have to check for fatal_signal_pending as well
> > AFAIU. Fiddling with find find_lock_task_mm will not help you though
> > unless I am missing something.
> 
> I do want to guarantee that the SysRq-f (and timeout based next victim
> selection) never chooses a process which has a TIF_MEMDIE thread.

Sigh... see what I have written in the paragraph you are replying to...

> I don't like current "oom: clear TIF_MEMDIE after oom_reaper managed to unmap
> the address space" patch unless both "mm,oom: exclude TIF_MEMDIE processes from
> candidates." patch and "mm,oom: Re-enable OOM killer using timers."

Those patches are definitely not a prerequisite from the functional
point of view and putting them together as a prerequisite sounds like
blocking a useful feature without technical grounds to me.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2] mm,oom: Exclude TIF_MEMDIE processes from candidates.
  2016-01-08 10:09           ` Tetsuo Handa
@ 2016-01-13  0:32             ` David Rientjes
  -1 siblings, 0 replies; 44+ messages in thread
From: David Rientjes @ 2016-01-13  0:32 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, hannes, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Fri, 8 Jan 2016, Tetsuo Handa wrote:

> The OOM reaper kernel thread can reclaim OOM victim's memory before the
> victim terminates.  But since oom_kill_process() tries to kill children of
> the memory hog process first, the OOM reaper can not reclaim enough memory
> for terminating the victim if the victim is consuming little memory.  The
> result is OOM livelock as usual, for timeout based next OOM victim
> selection is not implemented.
> 
> While SysRq-f (manual invocation of the OOM killer) can wake up the OOM
> killer, the OOM killer chooses the same OOM victim which already has
> TIF_MEMDIE.  This is effectively disabling SysRq-f.
> 
> This patch excludes TIF_MEMDIE processes from candidates so that the
> memory hog process itself will be killed when all children of the memory
> hog process got stuck with TIF_MEMDIE pending.
> 
> [  120.078776] oom-write invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
> [  120.088610] oom-write cpuset=/ mems_allowed=0
> [  120.095558] CPU: 0 PID: 9546 Comm: oom-write Not tainted 4.4.0-rc6-next-20151223 #260
> (...snipped...)
> [  120.194148] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> (...snipped...)
> [  120.260191] [ 9546]  1000  9546   541716   453473     896       6        0             0 oom-write
> [  120.262166] [ 9547]  1000  9547       40        1       3       2        0             0 write
> [  120.264071] [ 9548]  1000  9548       40        1       3       2        0             0 write
> [  120.265939] [ 9549]  1000  9549       40        1       4       2        0             0 write
> [  120.267794] [ 9550]  1000  9550       40        1       3       2        0             0 write
> [  120.269654] [ 9551]  1000  9551       40        1       3       2        0             0 write
> [  120.271447] [ 9552]  1000  9552       40        1       3       2        0             0 write
> [  120.273220] [ 9553]  1000  9553       40        1       3       2        0             0 write
> [  120.274975] [ 9554]  1000  9554       40        1       3       2        0             0 write
> [  120.276745] [ 9555]  1000  9555       40        1       3       2        0             0 write
> [  120.278516] [ 9556]  1000  9556       40        1       3       2        0             0 write
> [  120.280227] Out of memory: Kill process 9546 (oom-write) score 892 or sacrifice child
> [  120.282010] Killed process 9549 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> (...snipped...)
> [  122.506001] systemd-journal invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|GFP_COLD)
> [  122.515041] systemd-journal cpuset=/ mems_allowed=0
> (...snipped...)
> [  122.697515] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
> [  122.699492] [ 9551]  1000  9551       40        1       3       2        0             0 write
> [  122.701399] [ 9552]  1000  9552       40        1       3       2        0             0 write
> [  122.703282] [ 9553]  1000  9553       40        1       3       2        0             0 write
> [  122.705188] [ 9554]  1000  9554       40        1       3       2        0             0 write
> [  122.707017] [ 9555]  1000  9555       40        1       3       2        0             0 write
> [  122.708842] [ 9556]  1000  9556       40        1       3       2        0             0 write
> [  122.710675] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
> [  122.712475] Killed process 9551 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> [  139.606508] sysrq: SysRq : Manual OOM execution
> [  139.612371] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
> [  139.620210] kworker/0:2 cpuset=/ mems_allowed=0
> (...snipped...)
> [  139.795759] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
> [  139.797649] [ 9551]  1000  9551       40        0       3       2        0             0 write
> [  139.799526] [ 9552]  1000  9552       40        1       3       2        0             0 write
> [  139.801368] [ 9553]  1000  9553       40        1       3       2        0             0 write
> [  139.803249] [ 9554]  1000  9554       40        1       3       2        0             0 write
> [  139.805020] [ 9555]  1000  9555       40        1       3       2        0             0 write
> [  139.806799] [ 9556]  1000  9556       40        1       3       2        0             0 write
> [  139.808524] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
> [  139.810216] Killed process 9552 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> (...snipped...)
> [  142.571815] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
> [  142.573840] [ 9551]  1000  9551       40        0       3       2        0             0 write
> [  142.575754] [ 9552]  1000  9552       40        0       3       2        0             0 write
> [  142.577633] [ 9553]  1000  9553       40        1       3       2        0             0 write
> [  142.579433] [ 9554]  1000  9554       40        1       3       2        0             0 write
> [  142.581250] [ 9555]  1000  9555       40        1       3       2        0             0 write
> [  142.583003] [ 9556]  1000  9556       40        1       3       2        0             0 write
> [  142.585055] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
> [  142.586796] Killed process 9553 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> [  143.599058] sysrq: SysRq : Manual OOM execution
> [  143.604300] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
> (...snipped...)
> [  143.783739] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
> [  143.785691] [ 9551]  1000  9551       40        0       3       2        0             0 write
> [  143.787532] [ 9552]  1000  9552       40        0       3       2        0             0 write
> [  143.789377] [ 9553]  1000  9553       40        0       3       2        0             0 write
> [  143.791172] [ 9554]  1000  9554       40        1       3       2        0             0 write
> [  143.792985] [ 9555]  1000  9555       40        1       3       2        0             0 write
> [  143.794730] [ 9556]  1000  9556       40        1       3       2        0             0 write
> [  143.796723] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
> [  143.798338] Killed process 9554 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> [  144.374525] sysrq: SysRq : Manual OOM execution
> [  144.379779] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
> (...snipped...)
> [  144.560718] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
> [  144.562657] [ 9551]  1000  9551       40        0       3       2        0             0 write
> [  144.564560] [ 9552]  1000  9552       40        0       3       2        0             0 write
> [  144.566369] [ 9553]  1000  9553       40        0       3       2        0             0 write
> [  144.568246] [ 9554]  1000  9554       40        0       3       2        0             0 write
> [  144.570001] [ 9555]  1000  9555       40        1       3       2        0             0 write
> [  144.571794] [ 9556]  1000  9556       40        1       3       2        0             0 write
> [  144.573502] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
> [  144.575119] Killed process 9555 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> [  145.158485] sysrq: SysRq : Manual OOM execution
> [  145.163600] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
> (...snipped...)
> [  145.346059] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
> [  145.348012] [ 9551]  1000  9551       40        0       3       2        0             0 write
> [  145.349954] [ 9552]  1000  9552       40        0       3       2        0             0 write
> [  145.351817] [ 9553]  1000  9553       40        0       3       2        0             0 write
> [  145.353701] [ 9554]  1000  9554       40        0       3       2        0             0 write
> [  145.355568] [ 9555]  1000  9555       40        0       3       2        0             0 write
> [  145.357319] [ 9556]  1000  9556       40        1       3       2        0             0 write
> [  145.359114] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
> [  145.360733] Killed process 9556 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> [  169.158408] sysrq: SysRq : Manual OOM execution
> [  169.163612] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
> (...snipped...)
> [  169.343115] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
> [  169.345053] [ 9551]  1000  9551       40        0       3       2        0             0 write
> [  169.346884] [ 9552]  1000  9552       40        0       3       2        0             0 write
> [  169.348965] [ 9553]  1000  9553       40        0       3       2        0             0 write
> [  169.350893] [ 9554]  1000  9554       40        0       3       2        0             0 write
> [  169.352713] [ 9555]  1000  9555       40        0       3       2        0             0 write
> [  169.354551] [ 9556]  1000  9556       40        0       3       2        0             0 write
> [  169.356450] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
> [  169.358105] Killed process 9551 (write) total-vm:160kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [  178.950315] sysrq: SysRq : Manual OOM execution
> [  178.955560] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
> (...snipped...)
> [  179.140752] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
> [  179.142653] [ 9551]  1000  9551       40        0       3       2        0             0 write
> [  179.144997] [ 9552]  1000  9552       40        0       3       2        0             0 write
> [  179.146849] [ 9553]  1000  9553       40        0       3       2        0             0 write
> [  179.148654] [ 9554]  1000  9554       40        0       3       2        0             0 write
> [  179.150411] [ 9555]  1000  9555       40        0       3       2        0             0 write
> [  179.152291] [ 9556]  1000  9556       40        0       3       2        0             0 write
> [  179.154002] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
> [  179.155666] Killed process 9551 (write) total-vm:160kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>  mm/oom_kill.c | 30 +++++++++++++++++++++++++++---
>  1 file changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ef89fda..edce443 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -125,6 +125,30 @@ found:
>  }
> 
>  /*
> + * Treat the whole process p as unkillable when one of threads has
> + * TIF_MEMDIE pending. Otherwise, we may end up setting TIF_MEMDIE
> + * on the same victim forever (e.g. making SysRq-f unusable).
> + */
> +static struct task_struct *find_lock_non_victim_task_mm(struct task_struct *p)
> +{
> +	struct task_struct *t;
> +
> +	rcu_read_lock();
> +
> +	for_each_thread(p, t) {
> +		if (likely(!test_tsk_thread_flag(t, TIF_MEMDIE)))
> +			continue;
> +		t = NULL;
> +		goto found;
> +	}
> +	t = find_lock_task_mm(p);
> + found:
> +	rcu_read_unlock();
> +
> +	return t;
> +}
> +
> +/*
>   * order == -1 means the oom kill is required by sysrq, otherwise only
>   * for display purposes.
>   */
> @@ -171,7 +195,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
>  	if (oom_unkillable_task(p, memcg, nodemask))
>  		return 0;
> 
> -	p = find_lock_task_mm(p);
> +	p = find_lock_non_victim_task_mm(p);
>  	if (!p)
>  		return 0;
> 

I understand how this may make your test case pass, but I simply don't 
understand how this could possibly be the correct thing to do.  This would 
cause oom_badness() to return 0 for any process where a thread has 
TIF_MEMDIE set.  If the oom killer is called from the page allocator, 
kills a thread, and it is recalled before that thread may exit, then this 
will panic the system if there are no other eligible processes to kill.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2016-01-13  0:32             ` David Rientjes
  0 siblings, 0 replies; 44+ messages in thread
From: David Rientjes @ 2016-01-13  0:32 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, hannes, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Fri, 8 Jan 2016, Tetsuo Handa wrote:

> The OOM reaper kernel thread can reclaim OOM victim's memory before the
> victim terminates.  But since oom_kill_process() tries to kill children of
> the memory hog process first, the OOM reaper can not reclaim enough memory
> for terminating the victim if the victim is consuming little memory.  The
> result is OOM livelock as usual, for timeout based next OOM victim
> selection is not implemented.
> 
> While SysRq-f (manual invocation of the OOM killer) can wake up the OOM
> killer, the OOM killer chooses the same OOM victim which already has
> TIF_MEMDIE.  This is effectively disabling SysRq-f.
> 
> This patch excludes TIF_MEMDIE processes from candidates so that the
> memory hog process itself will be killed when all children of the memory
> hog process got stuck with TIF_MEMDIE pending.
> 
> [  120.078776] oom-write invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
> [  120.088610] oom-write cpuset=/ mems_allowed=0
> [  120.095558] CPU: 0 PID: 9546 Comm: oom-write Not tainted 4.4.0-rc6-next-20151223 #260
> (...snipped...)
> [  120.194148] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> (...snipped...)
> [  120.260191] [ 9546]  1000  9546   541716   453473     896       6        0             0 oom-write
> [  120.262166] [ 9547]  1000  9547       40        1       3       2        0             0 write
> [  120.264071] [ 9548]  1000  9548       40        1       3       2        0             0 write
> [  120.265939] [ 9549]  1000  9549       40        1       4       2        0             0 write
> [  120.267794] [ 9550]  1000  9550       40        1       3       2        0             0 write
> [  120.269654] [ 9551]  1000  9551       40        1       3       2        0             0 write
> [  120.271447] [ 9552]  1000  9552       40        1       3       2        0             0 write
> [  120.273220] [ 9553]  1000  9553       40        1       3       2        0             0 write
> [  120.274975] [ 9554]  1000  9554       40        1       3       2        0             0 write
> [  120.276745] [ 9555]  1000  9555       40        1       3       2        0             0 write
> [  120.278516] [ 9556]  1000  9556       40        1       3       2        0             0 write
> [  120.280227] Out of memory: Kill process 9546 (oom-write) score 892 or sacrifice child
> [  120.282010] Killed process 9549 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> (...snipped...)
> [  122.506001] systemd-journal invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|GFP_COLD)
> [  122.515041] systemd-journal cpuset=/ mems_allowed=0
> (...snipped...)
> [  122.697515] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
> [  122.699492] [ 9551]  1000  9551       40        1       3       2        0             0 write
> [  122.701399] [ 9552]  1000  9552       40        1       3       2        0             0 write
> [  122.703282] [ 9553]  1000  9553       40        1       3       2        0             0 write
> [  122.705188] [ 9554]  1000  9554       40        1       3       2        0             0 write
> [  122.707017] [ 9555]  1000  9555       40        1       3       2        0             0 write
> [  122.708842] [ 9556]  1000  9556       40        1       3       2        0             0 write
> [  122.710675] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
> [  122.712475] Killed process 9551 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> [  139.606508] sysrq: SysRq : Manual OOM execution
> [  139.612371] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
> [  139.620210] kworker/0:2 cpuset=/ mems_allowed=0
> (...snipped...)
> [  139.795759] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
> [  139.797649] [ 9551]  1000  9551       40        0       3       2        0             0 write
> [  139.799526] [ 9552]  1000  9552       40        1       3       2        0             0 write
> [  139.801368] [ 9553]  1000  9553       40        1       3       2        0             0 write
> [  139.803249] [ 9554]  1000  9554       40        1       3       2        0             0 write
> [  139.805020] [ 9555]  1000  9555       40        1       3       2        0             0 write
> [  139.806799] [ 9556]  1000  9556       40        1       3       2        0             0 write
> [  139.808524] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
> [  139.810216] Killed process 9552 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> (...snipped...)
> [  142.571815] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
> [  142.573840] [ 9551]  1000  9551       40        0       3       2        0             0 write
> [  142.575754] [ 9552]  1000  9552       40        0       3       2        0             0 write
> [  142.577633] [ 9553]  1000  9553       40        1       3       2        0             0 write
> [  142.579433] [ 9554]  1000  9554       40        1       3       2        0             0 write
> [  142.581250] [ 9555]  1000  9555       40        1       3       2        0             0 write
> [  142.583003] [ 9556]  1000  9556       40        1       3       2        0             0 write
> [  142.585055] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
> [  142.586796] Killed process 9553 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> [  143.599058] sysrq: SysRq : Manual OOM execution
> [  143.604300] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
> (...snipped...)
> [  143.783739] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
> [  143.785691] [ 9551]  1000  9551       40        0       3       2        0             0 write
> [  143.787532] [ 9552]  1000  9552       40        0       3       2        0             0 write
> [  143.789377] [ 9553]  1000  9553       40        0       3       2        0             0 write
> [  143.791172] [ 9554]  1000  9554       40        1       3       2        0             0 write
> [  143.792985] [ 9555]  1000  9555       40        1       3       2        0             0 write
> [  143.794730] [ 9556]  1000  9556       40        1       3       2        0             0 write
> [  143.796723] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
> [  143.798338] Killed process 9554 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> [  144.374525] sysrq: SysRq : Manual OOM execution
> [  144.379779] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
> (...snipped...)
> [  144.560718] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
> [  144.562657] [ 9551]  1000  9551       40        0       3       2        0             0 write
> [  144.564560] [ 9552]  1000  9552       40        0       3       2        0             0 write
> [  144.566369] [ 9553]  1000  9553       40        0       3       2        0             0 write
> [  144.568246] [ 9554]  1000  9554       40        0       3       2        0             0 write
> [  144.570001] [ 9555]  1000  9555       40        1       3       2        0             0 write
> [  144.571794] [ 9556]  1000  9556       40        1       3       2        0             0 write
> [  144.573502] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
> [  144.575119] Killed process 9555 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> [  145.158485] sysrq: SysRq : Manual OOM execution
> [  145.163600] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
> (...snipped...)
> [  145.346059] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
> [  145.348012] [ 9551]  1000  9551       40        0       3       2        0             0 write
> [  145.349954] [ 9552]  1000  9552       40        0       3       2        0             0 write
> [  145.351817] [ 9553]  1000  9553       40        0       3       2        0             0 write
> [  145.353701] [ 9554]  1000  9554       40        0       3       2        0             0 write
> [  145.355568] [ 9555]  1000  9555       40        0       3       2        0             0 write
> [  145.357319] [ 9556]  1000  9556       40        1       3       2        0             0 write
> [  145.359114] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
> [  145.360733] Killed process 9556 (write) total-vm:160kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> [  169.158408] sysrq: SysRq : Manual OOM execution
> [  169.163612] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
> (...snipped...)
> [  169.343115] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
> [  169.345053] [ 9551]  1000  9551       40        0       3       2        0             0 write
> [  169.346884] [ 9552]  1000  9552       40        0       3       2        0             0 write
> [  169.348965] [ 9553]  1000  9553       40        0       3       2        0             0 write
> [  169.350893] [ 9554]  1000  9554       40        0       3       2        0             0 write
> [  169.352713] [ 9555]  1000  9555       40        0       3       2        0             0 write
> [  169.354551] [ 9556]  1000  9556       40        0       3       2        0             0 write
> [  169.356450] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
> [  169.358105] Killed process 9551 (write) total-vm:160kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [  178.950315] sysrq: SysRq : Manual OOM execution
> [  178.955560] kworker/0:2 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL)
> (...snipped...)
> [  179.140752] [ 9546]  1000  9546   541716   458687     906       6        0             0 oom-write
> [  179.142653] [ 9551]  1000  9551       40        0       3       2        0             0 write
> [  179.144997] [ 9552]  1000  9552       40        0       3       2        0             0 write
> [  179.146849] [ 9553]  1000  9553       40        0       3       2        0             0 write
> [  179.148654] [ 9554]  1000  9554       40        0       3       2        0             0 write
> [  179.150411] [ 9555]  1000  9555       40        0       3       2        0             0 write
> [  179.152291] [ 9556]  1000  9556       40        0       3       2        0             0 write
> [  179.154002] Out of memory: Kill process 9546 (oom-write) score 902 or sacrifice child
> [  179.155666] Killed process 9551 (write) total-vm:160kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>  mm/oom_kill.c | 30 +++++++++++++++++++++++++++---
>  1 file changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ef89fda..edce443 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -125,6 +125,30 @@ found:
>  }
> 
>  /*
> + * Treat the whole process p as unkillable when one of threads has
> + * TIF_MEMDIE pending. Otherwise, we may end up setting TIF_MEMDIE
> + * on the same victim forever (e.g. making SysRq-f unusable).
> + */
> +static struct task_struct *find_lock_non_victim_task_mm(struct task_struct *p)
> +{
> +	struct task_struct *t;
> +
> +	rcu_read_lock();
> +
> +	for_each_thread(p, t) {
> +		if (likely(!test_tsk_thread_flag(t, TIF_MEMDIE)))
> +			continue;
> +		t = NULL;
> +		goto found;
> +	}
> +	t = find_lock_task_mm(p);
> + found:
> +	rcu_read_unlock();
> +
> +	return t;
> +}
> +
> +/*
>   * order == -1 means the oom kill is required by sysrq, otherwise only
>   * for display purposes.
>   */
> @@ -171,7 +195,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
>  	if (oom_unkillable_task(p, memcg, nodemask))
>  		return 0;
> 
> -	p = find_lock_task_mm(p);
> +	p = find_lock_non_victim_task_mm(p);
>  	if (!p)
>  		return 0;
> 

I understand how this may make your test case pass, but I simply don't 
understand how this could possibly be the correct thing to do.  This would 
cause oom_badness() to return 0 for any process where a thread has 
TIF_MEMDIE set.  If the oom killer is called from the page allocator, 
kills a thread, and it is recalled before that thread may exit, then this 
will panic the system if there are no other eligible processes to kill.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
  2016-01-12 19:52               ` Michal Hocko
@ 2016-01-13 10:15                 ` Tetsuo Handa
  -1 siblings, 0 replies; 44+ messages in thread
From: Tetsuo Handa @ 2016-01-13 10:15 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> > > > Do we want to require SysRq-f for each thread in a process?
> > > > If g has 1024 p, dump_tasks() will do
> > > >
> > > >   pr_info("[%5d] %5d %5d %8lu %8lu %7ld %7ld %8lu         %5hd %s\n",
> > > >
> > > > for 1024 times? I think one SysRq-f per one process is sufficient.
> > >
> > > I am not following you here. If we kill the process the whole process
> > > group (aka all threads) will get killed which ever thread we happen to
> > > send the sigkill to.
> >
> > Please distinguish "sending SIGKILL to a process" and "all threads in that
> > process terminate".
>
> I didn't say anything about termination if your read my response again.

I think "the whole thread group will go down anyway" assumes termination.
The whole thread group (process group ?) will not go down if some of threads
got stuck and select_bad_process() and find_lock_task_mm() choose the same
thread forever. Even if we are lucky enough to terminate one thread per one
SysRq-f request, dump_tasks() will report the same thread group for many times
if we skip individual TIF_MEMDIE thread. In that regard, "[RFC 1/3] oom,sysrq:
Skip over oom victims and killed tasks" is nice that fatal_signal_pending(p)
prevents dump_tasks() from reporting the same thread group for many times if
that check is done in dump_tasks() as well.

By the way, why "[RFC 1/3] oom,sysrq: Skip over oom victims and killed tasks"
does not check task_will_free_mem(p) while "[RFC 2/3] oom: Do not sacrifice
already OOM killed children" checks task_will_free_mem(children)?
I think we can add a helper like

static bool task_should_terminate(struct task_struct *p)
{
	return fatal_signal_pending(p) || (p->flags & PF_EXITING) ||
	       test_tsk_thread_flag(p, TIF_MEMDIE);
}

and call it from both [RFC 1/3] and [RFC 2/3].

>
> [...]
>
> > > > How can we guarantee that find_lock_task_mm() from oom_kill_process()
> > > > chooses !TIF_MEMDIE thread when try_to_sacrifice_child() somehow chose
> > > > !TIF_MEMDIE thread? I think choosing !TIF_MEMDIE thread at
> > > > find_lock_task_mm() is the simplest way.
> > >
> > > find_lock_task_mm chosing TIF_MEMDIE thread shouldn't change anything
> > > because the whole thread group will go down anyway. If you want to
> > > guarantee that the sysrq+f never choses a task which has a TIF_MEMDIE
> > > thread then we would have to check for fatal_signal_pending as well
> > > AFAIU. Fiddling with find find_lock_task_mm will not help you though
> > > unless I am missing something.
> >
> > I do want to guarantee that the SysRq-f (and timeout based next victim
> > selection) never chooses a process which has a TIF_MEMDIE thread.
>
> Sigh... see what I have written in the paragraph you are replying to...
>

???



> > I don't like current "oom: clear TIF_MEMDIE after oom_reaper managed to unmap
> > the address space" patch unless both "mm,oom: exclude TIF_MEMDIE processes from
> > candidates." patch and "mm,oom: Re-enable OOM killer using timers."
>
> Those patches are definitely not a prerequisite from the functional
> point of view and putting them together as a prerequisite sounds like
> blocking a useful feature without technical grounds to me.

I like the OOM reaper approach. I said I don't like current patch because
current patch ignores unlikely cases described below. If all changes that
cover unlikely cases are implemented, my patches will become unneeded.

(1) Make the OOM reaper available on CONFIG_MMU=n kernels.

    I don't know about MMU, but I assume we can handle these errors.

    slub.c:(.text+0x4184): undefined reference to `tlb_gather_mmu'
    slub.c:(.text+0x41bc): undefined reference to `unmap_page_range'
    slub.c:(.text+0x41d8): undefined reference to `tlb_finish_mmu'

(2) Do not boot the system if failed to create the OOM reaper thread.

    We are already heavily depending on the OOM reaper.

    pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
                    PTR_ERR(oom_reaper_th));

(3) Eliminate locations that call mark_oom_victim() without
    making the OOM victim task under monitor of the OOM reaper.

    The OOM reaper needs to take actions when the OOM victim task got stuck
    because we (except me) do not want to use my sysctl-controlled timeout-
    based OOM victim selection.

    out_of_memory():
        if (current->mm &&
            (fatal_signal_pending(current) || task_will_free_mem(current))) {
                mark_oom_victim(current);
                return true;
        }

    oom_kill_process():
        task_lock(p);
        if (p->mm && task_will_free_mem(p)) {
                mark_oom_victim(p);
                task_unlock(p);
                put_task_struct(p);
                return;
        }
        task_unlock(p);

    mem_cgroup_out_of_memory():
        if (fatal_signal_pending(current) || task_will_free_mem(current)) {
                mark_oom_victim(current);
                goto unlock;
        }

    lowmem_scan():
        if (selected->mm)
                mark_oom_victim(selected);

(4) Don't select an OOM victim until mm_to_reap (or task_to_reap) becomes NULL.

    This is needed for making sure that any OOM victim is made under
    monitor of the OOM reaper in order to let the OOM reaper take action
    before leaving oom_reap_vmas() (or oom_reap_task()).

    Since the OOM reaper can do mm_to_reap (or task_to_reap) = NULL shortly
    (e.g. within a second if it retries for 10 times with 0.1 second interval),
    waiting should not become a problem.

(5) Decrease oom_score_adj value after the OOM reaper reclaimed memory.

    If __oom_reap_vmas(mm) (or __oom_reap_task(tsk)) succeeded, set oom_score_adj
    value of all tasks sharing the same mm to -1000 (by walking the process list)
    and clear TIF_MEMDIE.

    Changing only the OOM victim's oom_score_adj is not sufficient
    when there are other thread groups sharing the OOM victim's memory
    (i.e. clone(!CLONE_THREAD && CLONE_VM) case).

(6) Decrease oom_score_adj value even if the OOM reaper failed to reclaim memory.

    If __oom_reap_vmas(mm) (or __oom_reap_task(tsk)) failed for 10 times, decrease
    oom_score_adj value of all tasks sharing the same mm and clear TIF_MEMDIE.
    This is needed for preventing the OOM killer from selecting the same thread
    group forever.

    An example is, set oom_score_adj to -999 if oom_score_adj is greater than
    -999, set -1000 if oom_score_adj is already -999. This will allow the OOM
    killer try to choose different OOM victims before retrying __oom_reap_vmas(mm)
    (or __oom_reap_task(tsk)) of this OOM victim, then trigger kernel panic if
    all OOM victims got -1000.

    Changing mmap_sem lock killable increases possibility of __oom_reap_vmas(mm)
    (or __oom_reap_task(tsk)) to succeed. But due to the changes in (3) and (4),
    there is no guarantee that TIF_MEMDIE is set to the thread which is looping at
    __alloc_pages_slowpath() with the mmap_sem held for writing. If the OOM killer
    were able to know which thread is looping at __alloc_pages_slowpath() with the
    mmap_sem held for writing (via per task_struct variable), the OOM killer would
    set TIF_MEMDIE on that thread before randomly choosing one thread using
    find_lock_task_mm().

(7) Decrease oom_score_adj value even if the OOM reaper is not allowed to reclaim
    memory.

    This is same with (6) except for cases where the OOM victim's memory is
    used by some OOM-unkillable threads (i.e. can_oom_reap = false case).

    Calling wake_oom_reaper() with can_oom_reap added is the simplest way for
    waiting for short period (e.g. a second) and change oom_score_adj value
    and clear TIF_MEMDIE.

Maybe I'm missing something else. But assuming that [RFC 1/3] and [RFC 2/3] does
what "mm,oom: exclude TIF_MEMDIE processes from candidates." patch will cover,
adding changes listed above to current "oom: clear TIF_MEMDIE after oom_reaper
managed to unmap the address space" patch will do what "mm,oom: Re-enable OOM
killer using timers." patch will cover.

Also, kmallocwd-like approach (i.e. walk the process list) will eliminate the
need for doing (3) and (4). By using memalloc_info, that kernel thread can do
better decision based on e.g. "Are you currently waiting at memory allocation?"
"Is the order a __GFP_NOFAIL-order-5-allocation?" when choosing an OOM victim.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2016-01-13 10:15                 ` Tetsuo Handa
  0 siblings, 0 replies; 44+ messages in thread
From: Tetsuo Handa @ 2016-01-13 10:15 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> > > > Do we want to require SysRq-f for each thread in a process?
> > > > If g has 1024 p, dump_tasks() will do
> > > >
> > > >   pr_info("[%5d] %5d %5d %8lu %8lu %7ld %7ld %8lu         %5hd %s\n",
> > > >
> > > > for 1024 times? I think one SysRq-f per one process is sufficient.
> > >
> > > I am not following you here. If we kill the process the whole process
> > > group (aka all threads) will get killed which ever thread we happen to
> > > send the sigkill to.
> >
> > Please distinguish "sending SIGKILL to a process" and "all threads in that
> > process terminate".
>
> I didn't say anything about termination if your read my response again.

I think "the whole thread group will go down anyway" assumes termination.
The whole thread group (process group ?) will not go down if some of threads
got stuck and select_bad_process() and find_lock_task_mm() choose the same
thread forever. Even if we are lucky enough to terminate one thread per one
SysRq-f request, dump_tasks() will report the same thread group for many times
if we skip individual TIF_MEMDIE thread. In that regard, "[RFC 1/3] oom,sysrq:
Skip over oom victims and killed tasks" is nice that fatal_signal_pending(p)
prevents dump_tasks() from reporting the same thread group for many times if
that check is done in dump_tasks() as well.

By the way, why "[RFC 1/3] oom,sysrq: Skip over oom victims and killed tasks"
does not check task_will_free_mem(p) while "[RFC 2/3] oom: Do not sacrifice
already OOM killed children" checks task_will_free_mem(children)?
I think we can add a helper like

static bool task_should_terminate(struct task_struct *p)
{
	return fatal_signal_pending(p) || (p->flags & PF_EXITING) ||
	       test_tsk_thread_flag(p, TIF_MEMDIE);
}

and call it from both [RFC 1/3] and [RFC 2/3].

>
> [...]
>
> > > > How can we guarantee that find_lock_task_mm() from oom_kill_process()
> > > > chooses !TIF_MEMDIE thread when try_to_sacrifice_child() somehow chose
> > > > !TIF_MEMDIE thread? I think choosing !TIF_MEMDIE thread at
> > > > find_lock_task_mm() is the simplest way.
> > >
> > > find_lock_task_mm chosing TIF_MEMDIE thread shouldn't change anything
> > > because the whole thread group will go down anyway. If you want to
> > > guarantee that the sysrq+f never choses a task which has a TIF_MEMDIE
> > > thread then we would have to check for fatal_signal_pending as well
> > > AFAIU. Fiddling with find find_lock_task_mm will not help you though
> > > unless I am missing something.
> >
> > I do want to guarantee that the SysRq-f (and timeout based next victim
> > selection) never chooses a process which has a TIF_MEMDIE thread.
>
> Sigh... see what I have written in the paragraph you are replying to...
>

???



> > I don't like current "oom: clear TIF_MEMDIE after oom_reaper managed to unmap
> > the address space" patch unless both "mm,oom: exclude TIF_MEMDIE processes from
> > candidates." patch and "mm,oom: Re-enable OOM killer using timers."
>
> Those patches are definitely not a prerequisite from the functional
> point of view and putting them together as a prerequisite sounds like
> blocking a useful feature without technical grounds to me.

I like the OOM reaper approach. I said I don't like current patch because
current patch ignores unlikely cases described below. If all changes that
cover unlikely cases are implemented, my patches will become unneeded.

(1) Make the OOM reaper available on CONFIG_MMU=n kernels.

    I don't know about MMU, but I assume we can handle these errors.

    slub.c:(.text+0x4184): undefined reference to `tlb_gather_mmu'
    slub.c:(.text+0x41bc): undefined reference to `unmap_page_range'
    slub.c:(.text+0x41d8): undefined reference to `tlb_finish_mmu'

(2) Do not boot the system if failed to create the OOM reaper thread.

    We are already heavily depending on the OOM reaper.

    pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
                    PTR_ERR(oom_reaper_th));

(3) Eliminate locations that call mark_oom_victim() without
    making the OOM victim task under monitor of the OOM reaper.

    The OOM reaper needs to take actions when the OOM victim task got stuck
    because we (except me) do not want to use my sysctl-controlled timeout-
    based OOM victim selection.

    out_of_memory():
        if (current->mm &&
            (fatal_signal_pending(current) || task_will_free_mem(current))) {
                mark_oom_victim(current);
                return true;
        }

    oom_kill_process():
        task_lock(p);
        if (p->mm && task_will_free_mem(p)) {
                mark_oom_victim(p);
                task_unlock(p);
                put_task_struct(p);
                return;
        }
        task_unlock(p);

    mem_cgroup_out_of_memory():
        if (fatal_signal_pending(current) || task_will_free_mem(current)) {
                mark_oom_victim(current);
                goto unlock;
        }

    lowmem_scan():
        if (selected->mm)
                mark_oom_victim(selected);

(4) Don't select an OOM victim until mm_to_reap (or task_to_reap) becomes NULL.

    This is needed for making sure that any OOM victim is made under
    monitor of the OOM reaper in order to let the OOM reaper take action
    before leaving oom_reap_vmas() (or oom_reap_task()).

    Since the OOM reaper can do mm_to_reap (or task_to_reap) = NULL shortly
    (e.g. within a second if it retries for 10 times with 0.1 second interval),
    waiting should not become a problem.

(5) Decrease oom_score_adj value after the OOM reaper reclaimed memory.

    If __oom_reap_vmas(mm) (or __oom_reap_task(tsk)) succeeded, set oom_score_adj
    value of all tasks sharing the same mm to -1000 (by walking the process list)
    and clear TIF_MEMDIE.

    Changing only the OOM victim's oom_score_adj is not sufficient
    when there are other thread groups sharing the OOM victim's memory
    (i.e. clone(!CLONE_THREAD && CLONE_VM) case).

(6) Decrease oom_score_adj value even if the OOM reaper failed to reclaim memory.

    If __oom_reap_vmas(mm) (or __oom_reap_task(tsk)) failed for 10 times, decrease
    oom_score_adj value of all tasks sharing the same mm and clear TIF_MEMDIE.
    This is needed for preventing the OOM killer from selecting the same thread
    group forever.

    An example is, set oom_score_adj to -999 if oom_score_adj is greater than
    -999, set -1000 if oom_score_adj is already -999. This will allow the OOM
    killer try to choose different OOM victims before retrying __oom_reap_vmas(mm)
    (or __oom_reap_task(tsk)) of this OOM victim, then trigger kernel panic if
    all OOM victims got -1000.

    Changing mmap_sem lock killable increases possibility of __oom_reap_vmas(mm)
    (or __oom_reap_task(tsk)) to succeed. But due to the changes in (3) and (4),
    there is no guarantee that TIF_MEMDIE is set to the thread which is looping at
    __alloc_pages_slowpath() with the mmap_sem held for writing. If the OOM killer
    were able to know which thread is looping at __alloc_pages_slowpath() with the
    mmap_sem held for writing (via per task_struct variable), the OOM killer would
    set TIF_MEMDIE on that thread before randomly choosing one thread using
    find_lock_task_mm().

(7) Decrease oom_score_adj value even if the OOM reaper is not allowed to reclaim
    memory.

    This is same with (6) except for cases where the OOM victim's memory is
    used by some OOM-unkillable threads (i.e. can_oom_reap = false case).

    Calling wake_oom_reaper() with can_oom_reap added is the simplest way for
    waiting for short period (e.g. a second) and change oom_score_adj value
    and clear TIF_MEMDIE.

Maybe I'm missing something else. But assuming that [RFC 1/3] and [RFC 2/3] does
what "mm,oom: exclude TIF_MEMDIE processes from candidates." patch will cover,
adding changes listed above to current "oom: clear TIF_MEMDIE after oom_reaper
managed to unmap the address space" patch will do what "mm,oom: Re-enable OOM
killer using timers." patch will cover.

Also, kmallocwd-like approach (i.e. walk the process list) will eliminate the
need for doing (3) and (4). By using memalloc_info, that kernel thread can do
better decision based on e.g. "Are you currently waiting at memory allocation?"
"Is the order a __GFP_NOFAIL-order-5-allocation?" when choosing an OOM victim.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2] mm,oom: Exclude TIF_MEMDIE processes from candidates.
  2016-01-13  0:32             ` David Rientjes
@ 2016-01-13 10:52               ` Tetsuo Handa
  -1 siblings, 0 replies; 44+ messages in thread
From: Tetsuo Handa @ 2016-01-13 10:52 UTC (permalink / raw)
  To: rientjes
  Cc: mhocko, hannes, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

David Rientjes wrote:
> > @@ -171,7 +195,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> >  	if (oom_unkillable_task(p, memcg, nodemask))
> >  		return 0;
> > 
> > -	p = find_lock_task_mm(p);
> > +	p = find_lock_non_victim_task_mm(p);
> >  	if (!p)
> >  		return 0;
> > 
> 
> I understand how this may make your test case pass, but I simply don't 
> understand how this could possibly be the correct thing to do.  This would 
> cause oom_badness() to return 0 for any process where a thread has 
> TIF_MEMDIE set.  If the oom killer is called from the page allocator, 
> kills a thread, and it is recalled before that thread may exit, then this 
> will panic the system if there are no other eligible processes to kill.
> 
Why? oom_badness() is called after oom_scan_process_thread() returned OOM_SCAN_OK.
oom_scan_process_thread() returns OOM_SCAN_ABORT if a thread has TIF_MEMDIE set.

If the TIF_MEMDIE thread already exited, find_lock_non_victim_task_mm() acts like
find_lock_task_mm(). Otherwise, oom_scan_process_thread() acts like a blocker.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2016-01-13 10:52               ` Tetsuo Handa
  0 siblings, 0 replies; 44+ messages in thread
From: Tetsuo Handa @ 2016-01-13 10:52 UTC (permalink / raw)
  To: rientjes
  Cc: mhocko, hannes, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

David Rientjes wrote:
> > @@ -171,7 +195,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> >  	if (oom_unkillable_task(p, memcg, nodemask))
> >  		return 0;
> > 
> > -	p = find_lock_task_mm(p);
> > +	p = find_lock_non_victim_task_mm(p);
> >  	if (!p)
> >  		return 0;
> > 
> 
> I understand how this may make your test case pass, but I simply don't 
> understand how this could possibly be the correct thing to do.  This would 
> cause oom_badness() to return 0 for any process where a thread has 
> TIF_MEMDIE set.  If the oom killer is called from the page allocator, 
> kills a thread, and it is recalled before that thread may exit, then this 
> will panic the system if there are no other eligible processes to kill.
> 
Why? oom_badness() is called after oom_scan_process_thread() returned OOM_SCAN_OK.
oom_scan_process_thread() returns OOM_SCAN_ABORT if a thread has TIF_MEMDIE set.

If the TIF_MEMDIE thread already exited, find_lock_non_victim_task_mm() acts like
find_lock_task_mm(). Otherwise, oom_scan_process_thread() acts like a blocker.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
  2016-01-13 10:15                 ` Tetsuo Handa
@ 2016-01-13 15:24                   ` Michal Hocko
  -1 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-01-13 15:24 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

On Wed 13-01-16 19:15:58, Tetsuo Handa wrote:
[...]
> I like the OOM reaper approach. I said I don't like current patch because
> current patch ignores unlikely cases described below.

articulate your concerns regardning oom reaper in the email thread which
proposes it: e.g. here
http://lkml.kernel.org/r/1452094975-551-2-git-send-email-mhocko%40kernel.org

discussing it in an unrelated thread will not help future searching of
this topic.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2016-01-13 15:24                   ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2016-01-13 15:24 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

On Wed 13-01-16 19:15:58, Tetsuo Handa wrote:
[...]
> I like the OOM reaper approach. I said I don't like current patch because
> current patch ignores unlikely cases described below.

articulate your concerns regardning oom reaper in the email thread which
proposes it: e.g. here
http://lkml.kernel.org/r/1452094975-551-2-git-send-email-mhocko%40kernel.org

discussing it in an unrelated thread will not help future searching of
this topic.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2] mm,oom: Exclude TIF_MEMDIE processes from candidates.
  2016-01-13 10:52               ` Tetsuo Handa
@ 2016-01-14  0:57                 ` David Rientjes
  -1 siblings, 0 replies; 44+ messages in thread
From: David Rientjes @ 2016-01-14  0:57 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, hannes, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Wed, 13 Jan 2016, Tetsuo Handa wrote:

> David Rientjes wrote:
> > > @@ -171,7 +195,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> > >  	if (oom_unkillable_task(p, memcg, nodemask))
> > >  		return 0;
> > > 
> > > -	p = find_lock_task_mm(p);
> > > +	p = find_lock_non_victim_task_mm(p);
> > >  	if (!p)
> > >  		return 0;
> > > 
> > 
> > I understand how this may make your test case pass, but I simply don't 
> > understand how this could possibly be the correct thing to do.  This would 
> > cause oom_badness() to return 0 for any process where a thread has 
> > TIF_MEMDIE set.  If the oom killer is called from the page allocator, 
> > kills a thread, and it is recalled before that thread may exit, then this 
> > will panic the system if there are no other eligible processes to kill.
> > 
> Why? oom_badness() is called after oom_scan_process_thread() returned OOM_SCAN_OK.
> oom_scan_process_thread() returns OOM_SCAN_ABORT if a thread has TIF_MEMDIE set.
> 

oom_scan_process_thread() checks for TIF_MEMDIE on p, not on p's threads.  
If one of p's threads has TIF_MEMDIE set and p does not, we actually want 
to set TIF_MEMDIE for p.  That's the current behavior since it will lead 
to p->mm memory freeing.  Your patch is excluding such processes entirely 
and selecting another process to kill unnecessarily.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2016-01-14  0:57                 ` David Rientjes
  0 siblings, 0 replies; 44+ messages in thread
From: David Rientjes @ 2016-01-14  0:57 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, hannes, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Wed, 13 Jan 2016, Tetsuo Handa wrote:

> David Rientjes wrote:
> > > @@ -171,7 +195,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> > >  	if (oom_unkillable_task(p, memcg, nodemask))
> > >  		return 0;
> > > 
> > > -	p = find_lock_task_mm(p);
> > > +	p = find_lock_non_victim_task_mm(p);
> > >  	if (!p)
> > >  		return 0;
> > > 
> > 
> > I understand how this may make your test case pass, but I simply don't 
> > understand how this could possibly be the correct thing to do.  This would 
> > cause oom_badness() to return 0 for any process where a thread has 
> > TIF_MEMDIE set.  If the oom killer is called from the page allocator, 
> > kills a thread, and it is recalled before that thread may exit, then this 
> > will panic the system if there are no other eligible processes to kill.
> > 
> Why? oom_badness() is called after oom_scan_process_thread() returned OOM_SCAN_OK.
> oom_scan_process_thread() returns OOM_SCAN_ABORT if a thread has TIF_MEMDIE set.
> 

oom_scan_process_thread() checks for TIF_MEMDIE on p, not on p's threads.  
If one of p's threads has TIF_MEMDIE set and p does not, we actually want 
to set TIF_MEMDIE for p.  That's the current behavior since it will lead 
to p->mm memory freeing.  Your patch is excluding such processes entirely 
and selecting another process to kill unnecessarily.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2] mm,oom: Exclude TIF_MEMDIE processes from candidates.
  2016-01-14  0:57                 ` David Rientjes
@ 2016-01-14 10:26                   ` Tetsuo Handa
  -1 siblings, 0 replies; 44+ messages in thread
From: Tetsuo Handa @ 2016-01-14 10:26 UTC (permalink / raw)
  To: rientjes
  Cc: mhocko, hannes, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

David Rientjes wrote:
> On Wed, 13 Jan 2016, Tetsuo Handa wrote:
> 
> > David Rientjes wrote:
> > > > @@ -171,7 +195,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> > > >  	if (oom_unkillable_task(p, memcg, nodemask))
> > > >  		return 0;
> > > > 
> > > > -	p = find_lock_task_mm(p);
> > > > +	p = find_lock_non_victim_task_mm(p);
> > > >  	if (!p)
> > > >  		return 0;
> > > > 
> > > 
> > > I understand how this may make your test case pass, but I simply don't 
> > > understand how this could possibly be the correct thing to do.  This would 
> > > cause oom_badness() to return 0 for any process where a thread has 
> > > TIF_MEMDIE set.  If the oom killer is called from the page allocator, 
> > > kills a thread, and it is recalled before that thread may exit, then this 
> > > will panic the system if there are no other eligible processes to kill.
> > > 
> > Why? oom_badness() is called after oom_scan_process_thread() returned OOM_SCAN_OK.
> > oom_scan_process_thread() returns OOM_SCAN_ABORT if a thread has TIF_MEMDIE set.
> > 
> 
> oom_scan_process_thread() checks for TIF_MEMDIE on p, not on p's threads.
> If one of p's threads has TIF_MEMDIE set and p does not, we actually want 
> to set TIF_MEMDIE for p.  That's the current behavior since it will lead 
> to p->mm memory freeing.  Your patch is excluding such processes entirely 
> and selecting another process to kill unnecessarily.
> 

I think p's threads are checked by oom_scan_process_thread() for TIF_MEMDIE
even if p does not have TIF_MEMDIE. What am I misunderstanding about what
for_each_process_thread(g, p) is doing?

  #define for_each_process_thread(p, t) for_each_process(p) for_each_thread(p, t)

  select_bad_process() {
    for_each_process_thread(g, p) {
      oom_scan_process_thread(oc, p, totalpages));
      oom_badness(p);
    }
  }

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2016-01-14 10:26                   ` Tetsuo Handa
  0 siblings, 0 replies; 44+ messages in thread
From: Tetsuo Handa @ 2016-01-14 10:26 UTC (permalink / raw)
  To: rientjes
  Cc: mhocko, hannes, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

David Rientjes wrote:
> On Wed, 13 Jan 2016, Tetsuo Handa wrote:
> 
> > David Rientjes wrote:
> > > > @@ -171,7 +195,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> > > >  	if (oom_unkillable_task(p, memcg, nodemask))
> > > >  		return 0;
> > > > 
> > > > -	p = find_lock_task_mm(p);
> > > > +	p = find_lock_non_victim_task_mm(p);
> > > >  	if (!p)
> > > >  		return 0;
> > > > 
> > > 
> > > I understand how this may make your test case pass, but I simply don't 
> > > understand how this could possibly be the correct thing to do.  This would 
> > > cause oom_badness() to return 0 for any process where a thread has 
> > > TIF_MEMDIE set.  If the oom killer is called from the page allocator, 
> > > kills a thread, and it is recalled before that thread may exit, then this 
> > > will panic the system if there are no other eligible processes to kill.
> > > 
> > Why? oom_badness() is called after oom_scan_process_thread() returned OOM_SCAN_OK.
> > oom_scan_process_thread() returns OOM_SCAN_ABORT if a thread has TIF_MEMDIE set.
> > 
> 
> oom_scan_process_thread() checks for TIF_MEMDIE on p, not on p's threads.
> If one of p's threads has TIF_MEMDIE set and p does not, we actually want 
> to set TIF_MEMDIE for p.  That's the current behavior since it will lead 
> to p->mm memory freeing.  Your patch is excluding such processes entirely 
> and selecting another process to kill unnecessarily.
> 

I think p's threads are checked by oom_scan_process_thread() for TIF_MEMDIE
even if p does not have TIF_MEMDIE. What am I misunderstanding about what
for_each_process_thread(g, p) is doing?

  #define for_each_process_thread(p, t) for_each_process(p) for_each_thread(p, t)

  select_bad_process() {
    for_each_process_thread(g, p) {
      oom_scan_process_thread(oc, p, totalpages));
      oom_badness(p);
    }
  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2] mm,oom: Exclude TIF_MEMDIE processes from candidates.
  2016-01-14 10:26                   ` Tetsuo Handa
@ 2016-01-14 21:53                     ` David Rientjes
  -1 siblings, 0 replies; 44+ messages in thread
From: David Rientjes @ 2016-01-14 21:53 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, hannes, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Thu, 14 Jan 2016, Tetsuo Handa wrote:

> > > > > @@ -171,7 +195,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> > > > >  	if (oom_unkillable_task(p, memcg, nodemask))
> > > > >  		return 0;
> > > > > 
> > > > > -	p = find_lock_task_mm(p);
> > > > > +	p = find_lock_non_victim_task_mm(p);
> > > > >  	if (!p)
> > > > >  		return 0;
> > > > > 
> > > > 
> > > > I understand how this may make your test case pass, but I simply don't 
> > > > understand how this could possibly be the correct thing to do.  This would 
> > > > cause oom_badness() to return 0 for any process where a thread has 
> > > > TIF_MEMDIE set.  If the oom killer is called from the page allocator, 
> > > > kills a thread, and it is recalled before that thread may exit, then this 
> > > > will panic the system if there are no other eligible processes to kill.
> > > > 
> > > Why? oom_badness() is called after oom_scan_process_thread() returned OOM_SCAN_OK.
> > > oom_scan_process_thread() returns OOM_SCAN_ABORT if a thread has TIF_MEMDIE set.
> > > 
> > 
> > oom_scan_process_thread() checks for TIF_MEMDIE on p, not on p's threads.
> > If one of p's threads has TIF_MEMDIE set and p does not, we actually want 
> > to set TIF_MEMDIE for p.  That's the current behavior since it will lead 
> > to p->mm memory freeing.  Your patch is excluding such processes entirely 
> > and selecting another process to kill unnecessarily.
> > 
> 
> I think p's threads are checked by oom_scan_process_thread() for TIF_MEMDIE
> even if p does not have TIF_MEMDIE. What am I misunderstanding about what
> for_each_process_thread(g, p) is doing?
> 
>   #define for_each_process_thread(p, t) for_each_process(p) for_each_thread(p, t)
> 
>   select_bad_process() {
>     for_each_process_thread(g, p) {
>       oom_scan_process_thread(oc, p, totalpages));
>       oom_badness(p);
>     }
>   }
> 

Yes, select_bad_process() iterates over threads, that is obvious.  The 
point is that today it can select a thread independent of whether any of 
its other threads have TIF_MEMDIE set, which is the desired behavior per 
the above.  With your change, that is no longer possible because we 
disregard _all_ threads if one of them has TIF_MEMDIE set.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2016-01-14 21:53                     ` David Rientjes
  0 siblings, 0 replies; 44+ messages in thread
From: David Rientjes @ 2016-01-14 21:53 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, hannes, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Thu, 14 Jan 2016, Tetsuo Handa wrote:

> > > > > @@ -171,7 +195,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> > > > >  	if (oom_unkillable_task(p, memcg, nodemask))
> > > > >  		return 0;
> > > > > 
> > > > > -	p = find_lock_task_mm(p);
> > > > > +	p = find_lock_non_victim_task_mm(p);
> > > > >  	if (!p)
> > > > >  		return 0;
> > > > > 
> > > > 
> > > > I understand how this may make your test case pass, but I simply don't 
> > > > understand how this could possibly be the correct thing to do.  This would 
> > > > cause oom_badness() to return 0 for any process where a thread has 
> > > > TIF_MEMDIE set.  If the oom killer is called from the page allocator, 
> > > > kills a thread, and it is recalled before that thread may exit, then this 
> > > > will panic the system if there are no other eligible processes to kill.
> > > > 
> > > Why? oom_badness() is called after oom_scan_process_thread() returned OOM_SCAN_OK.
> > > oom_scan_process_thread() returns OOM_SCAN_ABORT if a thread has TIF_MEMDIE set.
> > > 
> > 
> > oom_scan_process_thread() checks for TIF_MEMDIE on p, not on p's threads.
> > If one of p's threads has TIF_MEMDIE set and p does not, we actually want 
> > to set TIF_MEMDIE for p.  That's the current behavior since it will lead 
> > to p->mm memory freeing.  Your patch is excluding such processes entirely 
> > and selecting another process to kill unnecessarily.
> > 
> 
> I think p's threads are checked by oom_scan_process_thread() for TIF_MEMDIE
> even if p does not have TIF_MEMDIE. What am I misunderstanding about what
> for_each_process_thread(g, p) is doing?
> 
>   #define for_each_process_thread(p, t) for_each_process(p) for_each_thread(p, t)
> 
>   select_bad_process() {
>     for_each_process_thread(g, p) {
>       oom_scan_process_thread(oc, p, totalpages));
>       oom_badness(p);
>     }
>   }
> 

Yes, select_bad_process() iterates over threads, that is obvious.  The 
point is that today it can select a thread independent of whether any of 
its other threads have TIF_MEMDIE set, which is the desired behavior per 
the above.  With your change, that is no longer possible because we 
disregard _all_ threads if one of them has TIF_MEMDIE set.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2] mm,oom: Exclude TIF_MEMDIE processes from candidates.
  2016-01-14 21:53                     ` David Rientjes
@ 2016-01-14 22:21                       ` Tetsuo Handa
  -1 siblings, 0 replies; 44+ messages in thread
From: Tetsuo Handa @ 2016-01-14 22:21 UTC (permalink / raw)
  To: rientjes
  Cc: mhocko, hannes, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

David Rientjes wrote:
> On Thu, 14 Jan 2016, Tetsuo Handa wrote:
> 
> > > > > > @@ -171,7 +195,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> > > > > >  	if (oom_unkillable_task(p, memcg, nodemask))
> > > > > >  		return 0;
> > > > > > 
> > > > > > -	p = find_lock_task_mm(p);
> > > > > > +	p = find_lock_non_victim_task_mm(p);
> > > > > >  	if (!p)
> > > > > >  		return 0;
> > > > > > 
> > > > > 
> > > > > I understand how this may make your test case pass, but I simply don't 
> > > > > understand how this could possibly be the correct thing to do.  This would 
> > > > > cause oom_badness() to return 0 for any process where a thread has 
> > > > > TIF_MEMDIE set.  If the oom killer is called from the page allocator, 
> > > > > kills a thread, and it is recalled before that thread may exit, then this 
> > > > > will panic the system if there are no other eligible processes to kill.
> > > > > 
> > > > Why? oom_badness() is called after oom_scan_process_thread() returned OOM_SCAN_OK.
> > > > oom_scan_process_thread() returns OOM_SCAN_ABORT if a thread has TIF_MEMDIE set.
> > > > 
> > > 
> > > oom_scan_process_thread() checks for TIF_MEMDIE on p, not on p's threads.
> > > If one of p's threads has TIF_MEMDIE set and p does not, we actually want 
> > > to set TIF_MEMDIE for p.  That's the current behavior since it will lead 
> > > to p->mm memory freeing.  Your patch is excluding such processes entirely 
> > > and selecting another process to kill unnecessarily.
> > > 
> > 
> > I think p's threads are checked by oom_scan_process_thread() for TIF_MEMDIE
> > even if p does not have TIF_MEMDIE. What am I misunderstanding about what
> > for_each_process_thread(g, p) is doing?
> > 
> >   #define for_each_process_thread(p, t) for_each_process(p) for_each_thread(p, t)
> > 
> >   select_bad_process() {
> >     for_each_process_thread(g, p) {
> >       oom_scan_process_thread(oc, p, totalpages));
> >       oom_badness(p);
> >     }
> >   }
> > 
> 
> Yes, select_bad_process() iterates over threads, that is obvious.  The 
> point is that today it can select a thread independent of whether any of 
> its other threads have TIF_MEMDIE set, which is the desired behavior per 
> the above.  With your change, that is no longer possible because we 
> disregard _all_ threads if one of them has TIF_MEMDIE set.
> 

I still cannot understand. Today select_bad_process() can select a thread
independent of whether any of its other threads have TIF_MEMDIE set. But
select_bad_process() after all ignores that thread selected by oom_badness()
logic and aborts the iterate loop as soon as oom_scan_process_thread() finds
a TIF_MEMDIE thread from all threads. Changing oom_badness() logic to skip
processes with TIF_MEMDIE threads does not change the task select_bad_process()
logic will finally return.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2] mm,oom: Exclude TIF_MEMDIE processes from candidates.
@ 2016-01-14 22:21                       ` Tetsuo Handa
  0 siblings, 0 replies; 44+ messages in thread
From: Tetsuo Handa @ 2016-01-14 22:21 UTC (permalink / raw)
  To: rientjes
  Cc: mhocko, hannes, akpm, mgorman, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

David Rientjes wrote:
> On Thu, 14 Jan 2016, Tetsuo Handa wrote:
> 
> > > > > > @@ -171,7 +195,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> > > > > >  	if (oom_unkillable_task(p, memcg, nodemask))
> > > > > >  		return 0;
> > > > > > 
> > > > > > -	p = find_lock_task_mm(p);
> > > > > > +	p = find_lock_non_victim_task_mm(p);
> > > > > >  	if (!p)
> > > > > >  		return 0;
> > > > > > 
> > > > > 
> > > > > I understand how this may make your test case pass, but I simply don't 
> > > > > understand how this could possibly be the correct thing to do.  This would 
> > > > > cause oom_badness() to return 0 for any process where a thread has 
> > > > > TIF_MEMDIE set.  If the oom killer is called from the page allocator, 
> > > > > kills a thread, and it is recalled before that thread may exit, then this 
> > > > > will panic the system if there are no other eligible processes to kill.
> > > > > 
> > > > Why? oom_badness() is called after oom_scan_process_thread() returned OOM_SCAN_OK.
> > > > oom_scan_process_thread() returns OOM_SCAN_ABORT if a thread has TIF_MEMDIE set.
> > > > 
> > > 
> > > oom_scan_process_thread() checks for TIF_MEMDIE on p, not on p's threads.
> > > If one of p's threads has TIF_MEMDIE set and p does not, we actually want 
> > > to set TIF_MEMDIE for p.  That's the current behavior since it will lead 
> > > to p->mm memory freeing.  Your patch is excluding such processes entirely 
> > > and selecting another process to kill unnecessarily.
> > > 
> > 
> > I think p's threads are checked by oom_scan_process_thread() for TIF_MEMDIE
> > even if p does not have TIF_MEMDIE. What am I misunderstanding about what
> > for_each_process_thread(g, p) is doing?
> > 
> >   #define for_each_process_thread(p, t) for_each_process(p) for_each_thread(p, t)
> > 
> >   select_bad_process() {
> >     for_each_process_thread(g, p) {
> >       oom_scan_process_thread(oc, p, totalpages));
> >       oom_badness(p);
> >     }
> >   }
> > 
> 
> Yes, select_bad_process() iterates over threads, that is obvious.  The 
> point is that today it can select a thread independent of whether any of 
> its other threads have TIF_MEMDIE set, which is the desired behavior per 
> the above.  With your change, that is no longer possible because we 
> disregard _all_ threads if one of them has TIF_MEMDIE set.
> 

I still cannot understand. Today select_bad_process() can select a thread
independent of whether any of its other threads have TIF_MEMDIE set. But
select_bad_process() after all ignores that thread selected by oom_badness()
logic and aborts the iterate loop as soon as oom_scan_process_thread() finds
a TIF_MEMDIE thread from all threads. Changing oom_badness() logic to skip
processes with TIF_MEMDIE threads does not change the task select_bad_process()
logic will finally return.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2016-01-14 22:21 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-29 13:58 [PATCH] mm,oom: Exclude TIF_MEMDIE processes from candidates Tetsuo Handa
2015-12-29 13:58 ` Tetsuo Handa
2016-01-07  9:15 ` Michal Hocko
2016-01-07  9:15   ` Michal Hocko
2016-01-07 13:31   ` Tetsuo Handa
2016-01-07 13:31     ` Tetsuo Handa
2016-01-07 14:58     ` Michal Hocko
2016-01-07 14:58       ` Michal Hocko
2016-01-07 15:38       ` Tetsuo Handa
2016-01-07 15:38         ` Tetsuo Handa
2016-01-11 15:18         ` Michal Hocko
2016-01-11 15:18           ` Michal Hocko
2016-01-12 11:32           ` Tetsuo Handa
2016-01-12 11:32             ` Tetsuo Handa
2016-01-12 19:52             ` Michal Hocko
2016-01-12 19:52               ` Michal Hocko
2016-01-13 10:15               ` Tetsuo Handa
2016-01-13 10:15                 ` Tetsuo Handa
2016-01-13 15:24                 ` Michal Hocko
2016-01-13 15:24                   ` Michal Hocko
2016-01-07 15:44       ` Michal Hocko
2016-01-07 15:44         ` Michal Hocko
2016-01-08 10:09         ` [PATCH v2] " Tetsuo Handa
2016-01-08 10:09           ` Tetsuo Handa
2016-01-13  0:32           ` David Rientjes
2016-01-13  0:32             ` David Rientjes
2016-01-13 10:52             ` Tetsuo Handa
2016-01-13 10:52               ` Tetsuo Handa
2016-01-14  0:57               ` David Rientjes
2016-01-14  0:57                 ` David Rientjes
2016-01-14 10:26                 ` Tetsuo Handa
2016-01-14 10:26                   ` Tetsuo Handa
2016-01-14 21:53                   ` David Rientjes
2016-01-14 21:53                     ` David Rientjes
2016-01-14 22:21                     ` Tetsuo Handa
2016-01-14 22:21                       ` Tetsuo Handa
2016-01-07 16:28 ` [PATCH] " Johannes Weiner
2016-01-07 16:28   ` Johannes Weiner
2016-01-08 12:37   ` Michal Hocko
2016-01-08 12:37     ` Michal Hocko
2016-01-08 13:14     ` Tetsuo Handa
2016-01-08 13:14       ` Tetsuo Handa
2016-01-08 13:41       ` Michal Hocko
2016-01-08 13:41         ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.