linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] preparation for merging the OOM reaper
@ 2016-02-17 10:28 Tetsuo Handa
  2016-02-17 10:29 ` [PATCH 1/6] mm,oom: exclude TIF_MEMDIE processes from candidates Tetsuo Handa
                   ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: Tetsuo Handa @ 2016-02-17 10:28 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel

I am posting this patchset for smoothly merging the OOM reaper without
worrying about corner cases. This patchset also applies cleanly on top of
the current Linus tree because this patchset is meant for applying before
merging the OOM reaper.

Several problems were found in oom reaper v5 patchset
( http://lkml.kernel.org/r/1454505240-23446-1-git-send-email-mhocko@kernel.org ).

(1) "[PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap
    the address space" was added in order to allow the OOM killer select
    next OOM victim by marking current OOM victim OOM-unkillable after the
    OOM reaper reaped current OOM victim's memory. But it was found that
    threads created by clone(!CLONE_SIGHAND && CLONE_VM) disable further
    OOM reaping because next OOM victim is sharing current OOM victim's
    memory and only current OOM victim is marked OOM-unkillable. While it
    would be possible to mark all processes sharing current OOM victim's
    memory OOM-unkillable, we are not trying to traverse the process list.
    ( http://lkml.kernel.org/r/201602042322.IAG65142.MOOJHFSVLOQFFt@I-love.SAKURA.ne.jp )

(2) It was found that clearing TIF_MEMDIE does not allow the OOM killer
    select next OOM victim if current OOM victim got stuck between getting
    PF_EXITING and doing victim's mm = NULL. Since it is possible that we
    hit silent OOM livelock before we select current OOM victim, we should
    fix it before merging the OOM reaper.
    ( http://lkml.kernel.org/r/20160111165214.GA32132@cmpxchg.org )

(3) "[PATCH 5/5] mm, oom_reaper: implement OOM victims queuing" was added
    in order to allow more robust queuing in case multiple OOM events occur
    in short period. But it was found that this queuing approach did not
    take into account oom_kill_allocating_task = 1 case which does not wait
    for existing TIF_MEMDIE threads to clear TIF_MEMDIE. Since this is a bug
    of the OOM killer rather than a bug of the OOM reaper, we should fix it
    before merging the OOM reaper.
    ( http://lkml.kernel.org/r/201602162011.ECG52697.VOLJFtOQHFMSFO@I-love.SAKURA.ne.jp )

(4) There is a very strong collision between Michal Hocko and Tetsuo Handa.
    Michal wants to merge the OOM reaper as soon as possible and correct
    corner cases afterward because this is not an urgent problem. Tetsuo
    wants to stop lying as soon as possible by papering over corner cases
    for now because this is "either address now or too late to address"
    problem, for this problem can annoy customers and technical staffs at
    support center (Tetsuo was working there) for next 10 years unless
    this problem is fixed before the DEADLINE (the day customers decide
    the distributor's specific kernel version to use for their systems).
    If we don't want to merge a mechanism for warning silent OOM livelock
    problems (e.g. kmallocwd), Tetsuo really wants to get rid of all
    possible locations that can cause silent OOM livelock problems.

This patchset does the following things.

  "[PATCH 1/6] mm,oom: exclude TIF_MEMDIE processes from candidates."
  allows SysRq-f to select !TIF_MEMDIE process which fixes a bug in
  current kernels.

  "[PATCH 2/6] mm,oom: don't abort on exiting processes when selecting
  a victim." fixes a problem (2) explained above.

  "[PATCH 3/6] mm,oom: exclude oom_task_origin processes if they are
  OOM victims." fixes a problem similar to (2) explained above.

  "[PATCH 4/6] mm,oom: exclude oom_task_origin processes if they are
  OOM-unkillable." fixes a problem similar to (2) explained above.

  "[PATCH 5/6] mm,oom: Re-enable OOM killer using timers." mitigates
  a problem (1) explained above, by allowing the kernel ignore existing
  TIF_MEMDIE threads when OOM livelock is detected by timer based
  heuristics.

  "[PATCH 6/6] mm,oom: wait for OOM victims when using
  oom_kill_allocating_task == 1" fixes a problem (3) explained above.

By applying this patchset prior to applying the OOM reaper patchset,
we can start the OOM reaper without correcting problems found in
"[PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap
the address space" and "[PATCH 5/5] mm, oom_reaper: implement OOM
victims queuing".

 oom_kill.c |   71 +++++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 51 insertions(+), 20 deletions(-)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/6] mm,oom: exclude TIF_MEMDIE processes from candidates.
  2016-02-17 10:28 [PATCH 0/6] preparation for merging the OOM reaper Tetsuo Handa
@ 2016-02-17 10:29 ` Tetsuo Handa
  2016-02-17 12:41   ` Michal Hocko
  2016-02-17 10:30 ` [PATCH 2/6] mm,oom: don't abort on exiting processes when selecting a victim Tetsuo Handa
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2016-02-17 10:29 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel

>From 142b08258e4c60834602e9b0a734564208bc6397 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 17 Feb 2016 16:29:29 +0900
Subject: [PATCH 1/6] mm,oom: exclude TIF_MEMDIE processes from candidates.

The OOM reaper kernel thread can reclaim OOM victim's memory before
the victim releases it. But it is possible that a TIF_MEMDIE thread
gets stuck at down_read(&mm->mmap_sem) in exit_mm() called from
do_exit() due to one of !TIF_MEMDIE threads doing a GFP_KERNEL
allocation between down_write(&mm->mmap_sem) and up_write(&mm->mmap_sem)
(e.g. mmap()). In that case, we need to use SysRq-f (manual invocation
of the OOM killer) because down_read_trylock(&mm->mmap_sem) by the OOM
reaper will not succeed. Also, there are other situations where the OOM
reaper cannot reap the victim's memory (e.g. CONFIG_MMU=n, victim's
memory is shared with OOM-unkillable processes) which will require
manual SysRq-f for making progress.

However, it is possible that the OOM killer chooses the same OOM victim
forever which already has TIF_MEMDIE. This is effectively disabling
SysRq-f. This patch excludes processes which has a TIF_MEMDIE thread
 from OOM victim candidates.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 30 +++++++++++++++++++++++++++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 871470f..27949ef 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -119,6 +119,30 @@ found:
 }
 
 /*
+ * Treat the whole process p as unkillable when one of threads has
+ * TIF_MEMDIE pending. Otherwise, we may end up setting TIF_MEMDIE
+ * on the same victim forever (e.g. making SysRq-f unusable).
+ */
+static struct task_struct *find_lock_non_victim_task_mm(struct task_struct *p)
+{
+	struct task_struct *t;
+
+	rcu_read_lock();
+
+	for_each_thread(p, t) {
+		if (likely(!test_tsk_thread_flag(t, TIF_MEMDIE)))
+			continue;
+		t = NULL;
+		goto found;
+	}
+	t = find_lock_task_mm(p);
+ found:
+	rcu_read_unlock();
+
+	return t;
+}
+
+/*
  * order == -1 means the oom kill is required by sysrq, otherwise only
  * for display purposes.
  */
@@ -165,7 +189,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	if (oom_unkillable_task(p, memcg, nodemask))
 		return 0;
 
-	p = find_lock_task_mm(p);
+	p = find_lock_non_victim_task_mm(p);
 	if (!p)
 		return 0;
 
@@ -361,7 +385,7 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask)
 		if (oom_unkillable_task(p, memcg, nodemask))
 			continue;
 
-		task = find_lock_task_mm(p);
+		task = find_lock_non_victim_task_mm(p);
 		if (!task) {
 			/*
 			 * This is a kthread or all of p's threads have already
@@ -562,7 +586,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	}
 	read_unlock(&tasklist_lock);
 
-	p = find_lock_task_mm(victim);
+	p = find_lock_non_victim_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
 		return;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/6] mm,oom: don't abort on exiting processes when selecting a victim.
  2016-02-17 10:28 [PATCH 0/6] preparation for merging the OOM reaper Tetsuo Handa
  2016-02-17 10:29 ` [PATCH 1/6] mm,oom: exclude TIF_MEMDIE processes from candidates Tetsuo Handa
@ 2016-02-17 10:30 ` Tetsuo Handa
  2016-02-17 12:54   ` Michal Hocko
  2016-02-17 10:32 ` [PATCH 3/6] mm,oom: exclude oom_task_origin processes if they are OOM victims Tetsuo Handa
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2016-02-17 10:30 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel

>From 22bd036766e70f0df38c38f3ecc226e857d20faf Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 17 Feb 2016 16:30:59 +0900
Subject: [PATCH 2/6] mm,oom: don't abort on exiting processes when selecting a victim.

Currently, oom_scan_process_thread() returns OOM_SCAN_ABORT when there
is a thread which is exiting. But it is possible that that thread is
blocked at down_read(&mm->mmap_sem) in exit_mm() called from do_exit()
whereas one of threads sharing that memory is doing a GFP_KERNEL
allocation between down_write(&mm->mmap_sem) and up_write(&mm->mmap_sem)
(e.g. mmap()). Under such situation, the OOM killer does not choose a
victim, which results in silent OOM livelock problem.

This patch changes oom_scan_process_thread() not to return OOM_SCAN_ABORT
when there is a thread which is exiting.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 27949ef..a3868fd 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -311,9 +311,6 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	if (oom_task_origin(task))
 		return OOM_SCAN_SELECT;
 
-	if (task_will_free_mem(task) && !is_sysrq_oom(oc))
-		return OOM_SCAN_ABORT;
-
 	return OOM_SCAN_OK;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/6] mm,oom: exclude oom_task_origin processes if they are OOM victims.
  2016-02-17 10:28 [PATCH 0/6] preparation for merging the OOM reaper Tetsuo Handa
  2016-02-17 10:29 ` [PATCH 1/6] mm,oom: exclude TIF_MEMDIE processes from candidates Tetsuo Handa
  2016-02-17 10:30 ` [PATCH 2/6] mm,oom: don't abort on exiting processes when selecting a victim Tetsuo Handa
@ 2016-02-17 10:32 ` Tetsuo Handa
  2016-02-17 13:02   ` Michal Hocko
  2016-02-17 10:33 ` [PATCH 4/6] mm,oom: exclude oom_task_origin processes if they are OOM-unkillable Tetsuo Handa
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2016-02-17 10:32 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel

>From f5531e726caad7431020c027b6900a8e2c678345 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 17 Feb 2016 16:32:37 +0900
Subject: [PATCH 3/6] mm,oom: exclude oom_task_origin processes if they are OOM victims.

Currently, oom_scan_process_thread() returns OOM_SCAN_SELECT when there
is a thread which returns oom_task_origin() == true. But it is possible
that that thread is sharing memory with OOM-unkillable processes or the
OOM reaper fails to reclaim enough memory. In that case, we must not
continue selecting such threads forever.

This patch changes oom_scan_process_thread() not to select a thread
which returns oom_task_origin() = true if TIF_MEMDIE is already set
because SysRq-f case can reach here. Since "mm,oom: exclude TIF_MEMDIE
processes from candidates." made sure that we will choose a !TIF_MEMDIE
thread when only some of threads are marked TIF_MEMDIE, we don't need to
check all threads which returns oom_task_origin() == true.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index a3868fd..b0c327d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -308,7 +308,7 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 * If task is allocating a lot of memory and has been marked to be
 	 * killed first if it triggers an oom, then select it.
 	 */
-	if (oom_task_origin(task))
+	if (oom_task_origin(task) && !test_tsk_thread_flag(task, TIF_MEMDIE))
 		return OOM_SCAN_SELECT;
 
 	return OOM_SCAN_OK;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/6] mm,oom: exclude oom_task_origin processes if they are OOM-unkillable.
  2016-02-17 10:28 [PATCH 0/6] preparation for merging the OOM reaper Tetsuo Handa
                   ` (2 preceding siblings ...)
  2016-02-17 10:32 ` [PATCH 3/6] mm,oom: exclude oom_task_origin processes if they are OOM victims Tetsuo Handa
@ 2016-02-17 10:33 ` Tetsuo Handa
  2016-02-17 13:10   ` Michal Hocko
  2016-02-17 10:34 ` [PATCH 5/6] mm,oom: Re-enable OOM killer using timers Tetsuo Handa
  2016-02-17 10:36 ` [PATCH 6/6] mm,oom: wait for OOM victims when using oom_kill_allocating_task == 1 Tetsuo Handa
  5 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2016-02-17 10:33 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel

>From 4924ca3031444bfb831b2d4f004e5a613ad48d68 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 17 Feb 2016 16:35:12 +0900
Subject: [PATCH 4/6] mm,oom: exclude oom_task_origin processes if they are OOM-unkillable.

oom_scan_process_thread() returns OOM_SCAN_SELECT when there is a
thread which returns oom_task_origin() == true. But it is possible
that that thread is marked as OOM-unkillable.

This patch changes oom_scan_process_thread() not to select it
if it is marked as OOM-unkillable.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b0c327d..ebc6764 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -308,7 +308,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 * If task is allocating a lot of memory and has been marked to be
 	 * killed first if it triggers an oom, then select it.
 	 */
-	if (oom_task_origin(task) && !test_tsk_thread_flag(task, TIF_MEMDIE))
+	if (oom_task_origin(task) && !test_tsk_thread_flag(task, TIF_MEMDIE) &&
+	    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
 		return OOM_SCAN_SELECT;
 
 	return OOM_SCAN_OK;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 5/6] mm,oom: Re-enable OOM killer using timers.
  2016-02-17 10:28 [PATCH 0/6] preparation for merging the OOM reaper Tetsuo Handa
                   ` (3 preceding siblings ...)
  2016-02-17 10:33 ` [PATCH 4/6] mm,oom: exclude oom_task_origin processes if they are OOM-unkillable Tetsuo Handa
@ 2016-02-17 10:34 ` Tetsuo Handa
  2016-02-17 13:20   ` Michal Hocko
  2016-02-17 10:36 ` [PATCH 6/6] mm,oom: wait for OOM victims when using oom_kill_allocating_task == 1 Tetsuo Handa
  5 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2016-02-17 10:34 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel

>From 6f07b71c97766ec111d26c3424bded465ca48195 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 17 Feb 2016 16:37:01 +0900
Subject: [PATCH 5/6] mm,oom: Re-enable OOM killer using timers.

We are trying to reduce the possibility of hitting OOM livelock by
introducing the OOM reaper, but there are situations where the OOM reaper
cannot reap the victim's memory. We want to introduce the OOM reaper as
simple as possible and make the OOM reaper better via incremental
development.

This patch adds a timer for handling corner cases where a TIF_MEMDIE
thread got stuck by reasons not handled by the initial version of the
OOM reaper. Since "mm,oom: exclude TIF_MEMDIE processes from candidates."
made sure that we won't choose the same OOM victim forever and this patch
makes sure that the kernel automatically presses SysRq-f upon OOM stalls,
we will not OOM stall forever as long as the OOM killer is called.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ebc6764..fba2c62 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -45,6 +45,11 @@ int sysctl_oom_dump_tasks = 1;
 
 DEFINE_MUTEX(oom_lock);
 
+static void oomkiller_reset(unsigned long arg)
+{
+}
+static DEFINE_TIMER(oomkiller_victim_wait_timer, oomkiller_reset, 0, 0);
+
 #ifdef CONFIG_NUMA
 /**
  * has_intersects_mems_allowed() - check task eligiblity for kill
@@ -299,7 +304,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
 		if (!is_sysrq_oom(oc))
-			return OOM_SCAN_ABORT;
+			return timer_pending(&oomkiller_victim_wait_timer) ?
+				OOM_SCAN_ABORT : OOM_SCAN_CONTINUE;
 	}
 	if (!task->mm)
 		return OOM_SCAN_CONTINUE;
@@ -452,6 +458,8 @@ void mark_oom_victim(struct task_struct *tsk)
 	 */
 	__thaw_task(tsk);
 	atomic_inc(&oom_victims);
+	/* Make sure that we won't wait for this task forever. */
+	mod_timer(&oomkiller_victim_wait_timer, jiffies + 5 * HZ);
 }
 
 /**
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 6/6] mm,oom: wait for OOM victims when using oom_kill_allocating_task == 1
  2016-02-17 10:28 [PATCH 0/6] preparation for merging the OOM reaper Tetsuo Handa
                   ` (4 preceding siblings ...)
  2016-02-17 10:34 ` [PATCH 5/6] mm,oom: Re-enable OOM killer using timers Tetsuo Handa
@ 2016-02-17 10:36 ` Tetsuo Handa
  2016-02-17 13:32   ` Michal Hocko
  5 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2016-02-17 10:36 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel

>From 0b36864d4100ecbdcaa2fc2d1927c9e270f1b629 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 17 Feb 2016 16:37:59 +0900
Subject: [PATCH 6/6] mm,oom: wait for OOM victims when using oom_kill_allocating_task == 1

Currently, out_of_memory() does not wait for existing TIF_MEMDIE threads
if /proc/sys/vm/oom_kill_allocating_task is set to 1. This can result in
killing more OOM victims than needed. We can wait for the OOM reaper to
reap memory used by existing TIF_MEMDIE threads if possible. If the OOM
reaper is not available, the system will be kept OOM stalled until an
OOM-unkillable thread does a GFP_FS allocation request and calls
oom_kill_allocating_task == 0 path.

This patch changes oom_kill_allocating_task == 1 case to call
select_bad_process() in order to wait for existing TIF_MEMDIE threads.
Since "mm,oom: exclude TIF_MEMDIE processes from candidates.",
"mm,oom: don't abort on exiting processes when selecting a victim.",
"mm,oom: exclude oom_task_origin processes if they are OOM victims.",
"mm,oom: exclude oom_task_origin processes if they are OOM-unkillable."
and "mm,oom: Re-enable OOM killer using timers." made sure that we never
wait for TIF_MEMDIE threads forever, waiting for TIF_MEMDIE threads for
oom_kill_allocating_task == 1 does not cause OOM livelock problem.

After this patch, we can safely merge the OOM reaper in the simplest
form, without worrying about corner cases.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index fba2c62..9cd1cd1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -737,15 +737,6 @@ bool out_of_memory(struct oom_control *oc)
 		oc->nodemask = NULL;
 	check_panic_on_oom(oc, constraint, NULL);
 
-	if (sysctl_oom_kill_allocating_task && current->mm &&
-	    !oom_unkillable_task(current, NULL, oc->nodemask) &&
-	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
-		get_task_struct(current);
-		oom_kill_process(oc, current, 0, totalpages, NULL,
-				 "Out of memory (oom_kill_allocating_task)");
-		return true;
-	}
-
 	p = select_bad_process(oc, &points, totalpages);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p && !is_sysrq_oom(oc)) {
@@ -753,8 +744,18 @@ bool out_of_memory(struct oom_control *oc)
 		panic("Out of memory and no killable processes...\n");
 	}
 	if (p && p != (void *)-1UL) {
-		oom_kill_process(oc, p, points, totalpages, NULL,
-				 "Out of memory");
+		const char *message = "Out of memory";
+
+		if (sysctl_oom_kill_allocating_task && current->mm &&
+		    !oom_unkillable_task(current, NULL, oc->nodemask) &&
+		    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
+			put_task_struct(p);
+			p = current;
+			get_task_struct(p);
+			message = "Out of memory (oom_kill_allocating_task)";
+			points = 0;
+		}
+		oom_kill_process(oc, p, points, totalpages, NULL, message);
 		/*
 		 * Give the killed process a good chance to exit before trying
 		 * to allocate memory again.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/6] mm,oom: exclude TIF_MEMDIE processes from candidates.
  2016-02-17 10:29 ` [PATCH 1/6] mm,oom: exclude TIF_MEMDIE processes from candidates Tetsuo Handa
@ 2016-02-17 12:41   ` Michal Hocko
  2016-02-17 16:40     ` Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2016-02-17 12:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Wed 17-02-16 19:29:33, Tetsuo Handa wrote:
> >From 142b08258e4c60834602e9b0a734564208bc6397 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Wed, 17 Feb 2016 16:29:29 +0900
> Subject: [PATCH 1/6] mm,oom: exclude TIF_MEMDIE processes from candidates.
> 
> The OOM reaper kernel thread can reclaim OOM victim's memory before
> the victim releases it.

If this is aimed to be preparatory work, which I am not convinced about
to be honest, then referring to oom reaper is confusing and misleading.

> But it is possible that a TIF_MEMDIE thread
> gets stuck at down_read(&mm->mmap_sem) in exit_mm() called from
> do_exit() due to one of !TIF_MEMDIE threads doing a GFP_KERNEL
> allocation between down_write(&mm->mmap_sem) and up_write(&mm->mmap_sem)
> (e.g. mmap()). In that case, we need to use SysRq-f (manual invocation
> of the OOM killer) because down_read_trylock(&mm->mmap_sem) by the OOM
> reaper will not succeed.

But all the tasks sharing the mm with the oom victim will have
fatal_signal_pending and so they will get access to memory reserves and
that should help them to finish the allocation request. So the above
text is misleading.

If the down_read is blocked because down_write is blocked then a better
solution is to make down_write_killable which has been already proposed.

> Also, there are other situations where the OOM
> reaper cannot reap the victim's memory (e.g. CONFIG_MMU=n,

there was no clear evidence that this is a problem on !MMU
configurations.

> victim's memory is shared with OOM-unkillable processes) which will
> require manual SysRq-f for making progress.

Sharing mm with a task which is hidden from the OOM killer is a clear
misconfiguration IMO.
 
> However, it is possible that the OOM killer chooses the same OOM victim
> forever which already has TIF_MEMDIE.

This can happen only for the sysrq+f case AFAICS. Regular OOM killer
will stop scanning after it encounters the first TIF_MEMDIE task.
If you want to handle the sysrq+f case then it should be imho explicit.
Something I've tries here as patch 1/2
http://lkml.kernel.org/r/1452632425-20191-1-git-send-email-mhocko@kernel.org
which has been nacked. Maybe you can try again without
fatal_signal_pending resp. task_will_free_mem checks which were
controversial back then. Hiding this into find_lock_non_victim_task_mm
is just making the code more obscure and harder to read.

> This is effectively disabling
> SysRq-f. This patch excludes processes which has a TIF_MEMDIE thread
>  from OOM victim candidates.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

In short I dislike this patch. It makes the code harder to read and the
same can be solved more straightforward:

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 078e07ec0906..68cc130c163b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -281,6 +281,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
 		if (!is_sysrq_oom(oc))
 			return OOM_SCAN_ABORT;
+		else
+			return OOM_SCAN_CONTINUE;
 	}
 	if (!task->mm)
 		return OOM_SCAN_CONTINUE;
@@ -719,6 +721,9 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 
 			if (process_shares_mm(child, p->mm))
 				continue;
+
+			if (is_sysrq_oom(oc) && test_tsk_thread_flag(child, TIF_MEMDIE))
+				continue;
 			/*
 			 * oom_badness() returns 0 if the thread is unkillable
 			 */
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm,oom: don't abort on exiting processes when selecting a victim.
  2016-02-17 10:30 ` [PATCH 2/6] mm,oom: don't abort on exiting processes when selecting a victim Tetsuo Handa
@ 2016-02-17 12:54   ` Michal Hocko
  2016-02-17 13:07     ` Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2016-02-17 12:54 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Wed 17-02-16 19:30:41, Tetsuo Handa wrote:
> >From 22bd036766e70f0df38c38f3ecc226e857d20faf Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Wed, 17 Feb 2016 16:30:59 +0900
> Subject: [PATCH 2/6] mm,oom: don't abort on exiting processes when selecting a victim.
> 
> Currently, oom_scan_process_thread() returns OOM_SCAN_ABORT when there
> is a thread which is exiting. But it is possible that that thread is
> blocked at down_read(&mm->mmap_sem) in exit_mm() called from do_exit()
> whereas one of threads sharing that memory is doing a GFP_KERNEL
> allocation between down_write(&mm->mmap_sem) and up_write(&mm->mmap_sem)
> (e.g. mmap()). Under such situation, the OOM killer does not choose a
> victim, which results in silent OOM livelock problem.

Again, such a thread/task will have fatal_signal_pending and so have
access to memory reserves. So the text is slightly misleading imho.
Sure if the memory reserves are depleted then we will not move on but
then it is not clear whether the current patch helps either.

> This patch changes oom_scan_process_thread() not to return OOM_SCAN_ABORT
> when there is a thread which is exiting.

The same patch has been poseted already [1] so at least Johannes' s-o-b
(and CC) would be appropriate. As I've already said I am not against
this change, especially after oom_reaper is merged and the patch
description reformulated.

[1] http://lkml.kernel.org/r/569433bb.diO0RgkTdhop9gmH%25akpm%40linux-foundation.org
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/oom_kill.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 27949ef..a3868fd 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -311,9 +311,6 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  	if (oom_task_origin(task))
>  		return OOM_SCAN_SELECT;
>  
> -	if (task_will_free_mem(task) && !is_sysrq_oom(oc))
> -		return OOM_SCAN_ABORT;
> -
>  	return OOM_SCAN_OK;
>  }
>  
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/6] mm,oom: exclude oom_task_origin processes if they are OOM victims.
  2016-02-17 10:32 ` [PATCH 3/6] mm,oom: exclude oom_task_origin processes if they are OOM victims Tetsuo Handa
@ 2016-02-17 13:02   ` Michal Hocko
  0 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2016-02-17 13:02 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Wed 17-02-16 19:32:00, Tetsuo Handa wrote:
> >From f5531e726caad7431020c027b6900a8e2c678345 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Wed, 17 Feb 2016 16:32:37 +0900
> Subject: [PATCH 3/6] mm,oom: exclude oom_task_origin processes if they are OOM victims.
> 
> Currently, oom_scan_process_thread() returns OOM_SCAN_SELECT when there
> is a thread which returns oom_task_origin() == true. But it is possible
> that that thread is sharing memory with OOM-unkillable processes or the
> OOM reaper fails to reclaim enough memory. In that case, we must not
> continue selecting such threads forever.
> 
> This patch changes oom_scan_process_thread() not to select a thread
> which returns oom_task_origin() = true if TIF_MEMDIE is already set
> because SysRq-f case can reach here. Since "mm,oom: exclude TIF_MEMDIE
> processes from candidates." made sure that we will choose a !TIF_MEMDIE
> thread when only some of threads are marked TIF_MEMDIE, we don't need to
> check all threads which returns oom_task_origin() == true.

I do not think this is necessary. If you simply do OOM_SCAN_CONTINUE for
TIF_MEMDIE && is_sysrq_oom then you should be covered AFAICS.

> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/oom_kill.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index a3868fd..b0c327d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -308,7 +308,7 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  	 * If task is allocating a lot of memory and has been marked to be
>  	 * killed first if it triggers an oom, then select it.
>  	 */
> -	if (oom_task_origin(task))
> +	if (oom_task_origin(task) && !test_tsk_thread_flag(task, TIF_MEMDIE))
>  		return OOM_SCAN_SELECT;
>  
>  	return OOM_SCAN_OK;
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm,oom: don't abort on exiting processes when selecting a victim.
  2016-02-17 12:54   ` Michal Hocko
@ 2016-02-17 13:07     ` Tetsuo Handa
  2016-02-17 14:00       ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2016-02-17 13:07 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> On Wed 17-02-16 19:30:41, Tetsuo Handa wrote:
> > >From 22bd036766e70f0df38c38f3ecc226e857d20faf Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date: Wed, 17 Feb 2016 16:30:59 +0900
> > Subject: [PATCH 2/6] mm,oom: don't abort on exiting processes when selecting a victim.
> > 
> > Currently, oom_scan_process_thread() returns OOM_SCAN_ABORT when there
> > is a thread which is exiting. But it is possible that that thread is
> > blocked at down_read(&mm->mmap_sem) in exit_mm() called from do_exit()
> > whereas one of threads sharing that memory is doing a GFP_KERNEL
> > allocation between down_write(&mm->mmap_sem) and up_write(&mm->mmap_sem)
> > (e.g. mmap()). Under such situation, the OOM killer does not choose a
> > victim, which results in silent OOM livelock problem.
> 
> Again, such a thread/task will have fatal_signal_pending and so have
> access to memory reserves. So the text is slightly misleading imho.
> Sure if the memory reserves are depleted then we will not move on but
> then it is not clear whether the current patch helps either.

I don't think so.
Please see http://lkml.kernel.org/r/201602151958.HCJ48972.FFOFOLMHSQVJtO@I-love.SAKURA.ne.jp .
There is a race window before such a thread/task receives SIGKILL.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] mm,oom: exclude oom_task_origin processes if they are OOM-unkillable.
  2016-02-17 10:33 ` [PATCH 4/6] mm,oom: exclude oom_task_origin processes if they are OOM-unkillable Tetsuo Handa
@ 2016-02-17 13:10   ` Michal Hocko
  2016-02-17 13:36     ` Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2016-02-17 13:10 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Wed 17-02-16 19:33:07, Tetsuo Handa wrote:
> >From 4924ca3031444bfb831b2d4f004e5a613ad48d68 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Wed, 17 Feb 2016 16:35:12 +0900
> Subject: [PATCH 4/6] mm,oom: exclude oom_task_origin processes if they are OOM-unkillable.
> 
> oom_scan_process_thread() returns OOM_SCAN_SELECT when there is a
> thread which returns oom_task_origin() == true. But it is possible
> that that thread is marked as OOM-unkillable.
> 
> This patch changes oom_scan_process_thread() not to select it
> if it is marked as OOM-unkillable.

oom_task_origin is only swapoff and ksm_store right now. I seriously
doubt anybody sane will run them as OOM disabled (directly or
indirectly).

But you have a point that returing anything but OOM_SCAN_CONTINUE for
OOM_SCORE_ADJ_MIN from oom_scan_process_thread sounds suboptimal.
Sure such a check would be racy but do we actually care about a OOM vs.
oom_score_adj_write. I am dubious to say the least.

So wouldn't it make more sense to check for OOM_SCORE_ADJ_MIN at the
very top of oom_scan_process_thread instead?
 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/oom_kill.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index b0c327d..ebc6764 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -308,7 +308,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  	 * If task is allocating a lot of memory and has been marked to be
>  	 * killed first if it triggers an oom, then select it.
>  	 */
> -	if (oom_task_origin(task) && !test_tsk_thread_flag(task, TIF_MEMDIE))
> +	if (oom_task_origin(task) && !test_tsk_thread_flag(task, TIF_MEMDIE) &&
> +	    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
>  		return OOM_SCAN_SELECT;
>  
>  	return OOM_SCAN_OK;
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/6] mm,oom: Re-enable OOM killer using timers.
  2016-02-17 10:34 ` [PATCH 5/6] mm,oom: Re-enable OOM killer using timers Tetsuo Handa
@ 2016-02-17 13:20   ` Michal Hocko
  2016-04-09 14:00     ` Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2016-02-17 13:20 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Wed 17-02-16 19:34:46, Tetsuo Handa wrote:
> >From 6f07b71c97766ec111d26c3424bded465ca48195 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Wed, 17 Feb 2016 16:37:01 +0900
> Subject: [PATCH 5/6] mm,oom: Re-enable OOM killer using timers.
> 
> We are trying to reduce the possibility of hitting OOM livelock by
> introducing the OOM reaper, but there are situations where the OOM reaper
> cannot reap the victim's memory. We want to introduce the OOM reaper as
> simple as possible and make the OOM reaper better via incremental
> development.
> 
> This patch adds a timer for handling corner cases where a TIF_MEMDIE
> thread got stuck by reasons not handled by the initial version of the
> OOM reaper. Since "mm,oom: exclude TIF_MEMDIE processes from candidates."
> made sure that we won't choose the same OOM victim forever and this patch
> makes sure that the kernel automatically presses SysRq-f upon OOM stalls,
> we will not OOM stall forever as long as the OOM killer is called.

Can we actually start by incremental changes first and only get to this
after we cannot find a proper way to fix existing issues?

I would like to at least make mmap_sem taken for write killable
first. This should allow the oom_reaper to make a forward progress and
allow the OOM killer to select another task when necessary (e.g. the
victim wasn't sitting on a large amount of reclaimable memory). This
has an advantage that the TIF_MEMDIE release is bound to a clearly
defined action rather than a $RANDOM timemout which will always be hard
to justify. We can talk about timeout based solutions after we are able
to livelock the system even after all well defined actions will have
failed. I really consider it premature right now.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/oom_kill.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ebc6764..fba2c62 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -45,6 +45,11 @@ int sysctl_oom_dump_tasks = 1;
>  
>  DEFINE_MUTEX(oom_lock);
>  
> +static void oomkiller_reset(unsigned long arg)
> +{
> +}
> +static DEFINE_TIMER(oomkiller_victim_wait_timer, oomkiller_reset, 0, 0);
> +
>  #ifdef CONFIG_NUMA
>  /**
>   * has_intersects_mems_allowed() - check task eligiblity for kill
> @@ -299,7 +304,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  	 */
>  	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
>  		if (!is_sysrq_oom(oc))
> -			return OOM_SCAN_ABORT;
> +			return timer_pending(&oomkiller_victim_wait_timer) ?
> +				OOM_SCAN_ABORT : OOM_SCAN_CONTINUE;
>  	}
>  	if (!task->mm)
>  		return OOM_SCAN_CONTINUE;
> @@ -452,6 +458,8 @@ void mark_oom_victim(struct task_struct *tsk)
>  	 */
>  	__thaw_task(tsk);
>  	atomic_inc(&oom_victims);
> +	/* Make sure that we won't wait for this task forever. */
> +	mod_timer(&oomkiller_victim_wait_timer, jiffies + 5 * HZ);
>  }
>  
>  /**
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/6] mm,oom: wait for OOM victims when using oom_kill_allocating_task == 1
  2016-02-17 10:36 ` [PATCH 6/6] mm,oom: wait for OOM victims when using oom_kill_allocating_task == 1 Tetsuo Handa
@ 2016-02-17 13:32   ` Michal Hocko
  2016-02-18 10:45     ` Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2016-02-17 13:32 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Wed 17-02-16 19:36:36, Tetsuo Handa wrote:
> >From 0b36864d4100ecbdcaa2fc2d1927c9e270f1b629 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Wed, 17 Feb 2016 16:37:59 +0900
> Subject: [PATCH 6/6] mm,oom: wait for OOM victims when using oom_kill_allocating_task == 1
> 
> Currently, out_of_memory() does not wait for existing TIF_MEMDIE threads
> if /proc/sys/vm/oom_kill_allocating_task is set to 1. This can result in
> killing more OOM victims than needed. We can wait for the OOM reaper to
> reap memory used by existing TIF_MEMDIE threads if possible. If the OOM
> reaper is not available, the system will be kept OOM stalled until an
> OOM-unkillable thread does a GFP_FS allocation request and calls
> oom_kill_allocating_task == 0 path.
> 
> This patch changes oom_kill_allocating_task == 1 case to call
> select_bad_process() in order to wait for existing TIF_MEMDIE threads.

The primary motivation for oom_kill_allocating_task was to reduce the
overhead of select_bad_process. See fe071d7e8aae ("oom: add
oom_kill_allocating_task sysctl"). So this basically defeats the whole
purpose of the feature.

I am not user of this knob because it behaves absolutely randomly but
IMHO we should simply do something like the following. It would be more
compliant to the documentation and prevent from livelock which is
currently possible (albeit very unlikely) when a single task consimes
all the memory reserves and we keep looping over out_of_memory without
any progress.

But as I've said I have no idea whether somebody relies on the current
behavior so this is more of a thinking loudly than proposing an actual
patch at this point of time.
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 078e07ec0906..7de84fb2dd03 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -706,6 +706,9 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
 		message, task_pid_nr(p), p->comm, points);
 
+	if (sysctl_oom_kill_allocating_task)
+		goto kill;
+
 	/*
 	 * If any of p's children has a different mm and is eligible for kill,
 	 * the one with the highest oom_badness() score is sacrificed for its
@@ -734,6 +737,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	}
 	read_unlock(&tasklist_lock);
 
+kill:
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
@@ -888,6 +892,9 @@ bool out_of_memory(struct oom_control *oc)
 	if (sysctl_oom_kill_allocating_task && current->mm &&
 	    !oom_unkillable_task(current, NULL, oc->nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
+		if (test_thread_flag(TIF_MEMDIE))
+			panic("Out of memory (oom_kill_allocating_task) not able to make a forward progress");
+
 		get_task_struct(current);
 		oom_kill_process(oc, current, 0, totalpages, NULL,
 				 "Out of memory (oom_kill_allocating_task)");
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] mm,oom: exclude oom_task_origin processes if they are OOM-unkillable.
  2016-02-17 13:10   ` Michal Hocko
@ 2016-02-17 13:36     ` Tetsuo Handa
  2016-02-17 13:44       ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2016-02-17 13:36 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> On Wed 17-02-16 19:33:07, Tetsuo Handa wrote:
> > >From 4924ca3031444bfb831b2d4f004e5a613ad48d68 Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date: Wed, 17 Feb 2016 16:35:12 +0900
> > Subject: [PATCH 4/6] mm,oom: exclude oom_task_origin processes if they are OOM-unkillable.
> > 
> > oom_scan_process_thread() returns OOM_SCAN_SELECT when there is a
> > thread which returns oom_task_origin() == true. But it is possible
> > that that thread is marked as OOM-unkillable.
> > 
> > This patch changes oom_scan_process_thread() not to select it
> > if it is marked as OOM-unkillable.
> 
> oom_task_origin is only swapoff and ksm_store right now. I seriously
> doubt anybody sane will run them as OOM disabled (directly or
> indirectly).

I think that the OOM reaper will update such task as OOM-unkillable
after reaping that task's memory. This patch is intended for not to
fall into infinite loop after the OOM reaper updated it.

> 
> But you have a point that returing anything but OOM_SCAN_CONTINUE for
> OOM_SCORE_ADJ_MIN from oom_scan_process_thread sounds suboptimal.
> Sure such a check would be racy but do we actually care about a OOM vs.
> oom_score_adj_write. I am dubious to say the least.
> 
> So wouldn't it make more sense to check for OOM_SCORE_ADJ_MIN at the
> very top of oom_scan_process_thread instead?

Are you suggesting something like below?
(OOM_SCORE_ADJ_MIN check needs to be done after TIF_MEMDIE check)

enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
					struct task_struct *task, unsigned long totalpages)
{
	if (oom_unkillable_task(task, NULL, oc->nodemask))
		return OOM_SCAN_CONTINUE;

	/*
	 * This task already has access to memory reserves and is being killed.
	 * Don't allow any other task to have access to the reserves.
	 */
	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
		if (!is_sysrq_oom(oc))
			return OOM_SCAN_ABORT;
	}
	if (!task->mm || task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
		return OOM_SCAN_CONTINUE;
(...snipped...)
}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] mm,oom: exclude oom_task_origin processes if they are OOM-unkillable.
  2016-02-17 13:36     ` Tetsuo Handa
@ 2016-02-17 13:44       ` Michal Hocko
  0 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2016-02-17 13:44 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Wed 17-02-16 22:36:47, Tetsuo Handa wrote:
[...]
> Are you suggesting something like below?
> (OOM_SCORE_ADJ_MIN check needs to be done after TIF_MEMDIE check)
> 
> enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
> 					struct task_struct *task, unsigned long totalpages)
> {
> 	if (oom_unkillable_task(task, NULL, oc->nodemask))
> 		return OOM_SCAN_CONTINUE;
> 
> 	/*
> 	 * This task already has access to memory reserves and is being killed.
> 	 * Don't allow any other task to have access to the reserves.
> 	 */
> 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> 		if (!is_sysrq_oom(oc))
> 			return OOM_SCAN_ABORT;
> 	}
> 	if (!task->mm || task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> 		return OOM_SCAN_CONTINUE;

yes

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm,oom: don't abort on exiting processes when selecting a victim.
  2016-02-17 13:07     ` Tetsuo Handa
@ 2016-02-17 14:00       ` Michal Hocko
  2016-02-17 14:39         ` Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2016-02-17 14:00 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Wed 17-02-16 22:07:31, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 17-02-16 19:30:41, Tetsuo Handa wrote:
> > > >From 22bd036766e70f0df38c38f3ecc226e857d20faf Mon Sep 17 00:00:00 2001
> > > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Date: Wed, 17 Feb 2016 16:30:59 +0900
> > > Subject: [PATCH 2/6] mm,oom: don't abort on exiting processes when selecting a victim.
> > > 
> > > Currently, oom_scan_process_thread() returns OOM_SCAN_ABORT when there
> > > is a thread which is exiting. But it is possible that that thread is
> > > blocked at down_read(&mm->mmap_sem) in exit_mm() called from do_exit()
> > > whereas one of threads sharing that memory is doing a GFP_KERNEL
> > > allocation between down_write(&mm->mmap_sem) and up_write(&mm->mmap_sem)
> > > (e.g. mmap()). Under such situation, the OOM killer does not choose a
> > > victim, which results in silent OOM livelock problem.
> > 
> > Again, such a thread/task will have fatal_signal_pending and so have
> > access to memory reserves. So the text is slightly misleading imho.
> > Sure if the memory reserves are depleted then we will not move on but
> > then it is not clear whether the current patch helps either.
> 
> I don't think so.
> Please see http://lkml.kernel.org/r/201602151958.HCJ48972.FFOFOLMHSQVJtO@I-love.SAKURA.ne.jp .

I have missed this one. Reading...

Hmm, so you are not referring to OOM killed task but naturally exiting
thread which is racing with the OOM killer. I guess you have a point
there! Could you update the changelog with the above example and repost
please?

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm,oom: don't abort on exiting processes when selecting a victim.
  2016-02-17 14:00       ` Michal Hocko
@ 2016-02-17 14:39         ` Tetsuo Handa
  2016-02-17 15:01           ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2016-02-17 14:39 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> On Wed 17-02-16 22:07:31, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Wed 17-02-16 19:30:41, Tetsuo Handa wrote:
> > > > >From 22bd036766e70f0df38c38f3ecc226e857d20faf Mon Sep 17 00:00:00 2001
> > > > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > > Date: Wed, 17 Feb 2016 16:30:59 +0900
> > > > Subject: [PATCH 2/6] mm,oom: don't abort on exiting processes when selecting a victim.
> > > > 
> > > > Currently, oom_scan_process_thread() returns OOM_SCAN_ABORT when there
> > > > is a thread which is exiting. But it is possible that that thread is
> > > > blocked at down_read(&mm->mmap_sem) in exit_mm() called from do_exit()
> > > > whereas one of threads sharing that memory is doing a GFP_KERNEL
> > > > allocation between down_write(&mm->mmap_sem) and up_write(&mm->mmap_sem)
> > > > (e.g. mmap()). Under such situation, the OOM killer does not choose a
> > > > victim, which results in silent OOM livelock problem.
> > > 
> > > Again, such a thread/task will have fatal_signal_pending and so have
> > > access to memory reserves. So the text is slightly misleading imho.
> > > Sure if the memory reserves are depleted then we will not move on but
> > > then it is not clear whether the current patch helps either.
> > 
> > I don't think so.
> > Please see http://lkml.kernel.org/r/201602151958.HCJ48972.FFOFOLMHSQVJtO@I-love.SAKURA.ne.jp .
> 
> I have missed this one. Reading...
> 
> Hmm, so you are not referring to OOM killed task but naturally exiting
> thread which is racing with the OOM killer. I guess you have a point
> there! Could you update the changelog with the above example and repost
> please?
> 
Yes and I resent that patch as v2.

I think that the same problem exists for any task_will_free_mem()-based
optimizations. Can we eliminate them because these optimized paths are not
handled by the OOM reaper which means that we have no means other than
"[PATCH 5/6] mm,oom: Re-enable OOM killer using timers." ?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm,oom: don't abort on exiting processes when selecting a victim.
  2016-02-17 14:39         ` Tetsuo Handa
@ 2016-02-17 15:01           ` Michal Hocko
  2016-02-17 15:29             ` Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2016-02-17 15:01 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Wed 17-02-16 23:39:47, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 17-02-16 22:07:31, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Wed 17-02-16 19:30:41, Tetsuo Handa wrote:
> > > > > >From 22bd036766e70f0df38c38f3ecc226e857d20faf Mon Sep 17 00:00:00 2001
> > > > > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > > > Date: Wed, 17 Feb 2016 16:30:59 +0900
> > > > > Subject: [PATCH 2/6] mm,oom: don't abort on exiting processes when selecting a victim.
> > > > > 
> > > > > Currently, oom_scan_process_thread() returns OOM_SCAN_ABORT when there
> > > > > is a thread which is exiting. But it is possible that that thread is
> > > > > blocked at down_read(&mm->mmap_sem) in exit_mm() called from do_exit()
> > > > > whereas one of threads sharing that memory is doing a GFP_KERNEL
> > > > > allocation between down_write(&mm->mmap_sem) and up_write(&mm->mmap_sem)
> > > > > (e.g. mmap()). Under such situation, the OOM killer does not choose a
> > > > > victim, which results in silent OOM livelock problem.
> > > > 
> > > > Again, such a thread/task will have fatal_signal_pending and so have
> > > > access to memory reserves. So the text is slightly misleading imho.
> > > > Sure if the memory reserves are depleted then we will not move on but
> > > > then it is not clear whether the current patch helps either.
> > > 
> > > I don't think so.
> > > Please see http://lkml.kernel.org/r/201602151958.HCJ48972.FFOFOLMHSQVJtO@I-love.SAKURA.ne.jp .
> > 
> > I have missed this one. Reading...
> > 
> > Hmm, so you are not referring to OOM killed task but naturally exiting
> > thread which is racing with the OOM killer. I guess you have a point
> > there! Could you update the changelog with the above example and repost
> > please?
> > 
> Yes and I resent that patch as v2.
> 
> I think that the same problem exists for any task_will_free_mem()-based
> optimizations. Can we eliminate them because these optimized paths are not
> handled by the OOM reaper which means that we have no means other than
> "[PATCH 5/6] mm,oom: Re-enable OOM killer using timers." ?

Well, only oom_kill_process usage of task_will_free_mem might be a
problem because out_of_memory operates on the current task so it must be
in the allocation path and access to memory reserves should help it to
continue.
Wrt. oom_kill_process this will be more tricky. I guess we want to
teach oom_reaper to operate on such a task which would be a more robust
solution than removing the check altogether.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm,oom: don't abort on exiting processes when selecting a victim.
  2016-02-17 15:01           ` Michal Hocko
@ 2016-02-17 15:29             ` Tetsuo Handa
  2016-02-17 16:17               ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2016-02-17 15:29 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> > > > Please see http://lkml.kernel.org/r/201602151958.HCJ48972.FFOFOLMHSQVJtO@I-love.SAKURA.ne.jp .
> > > 
> > > I have missed this one. Reading...
> > > 
> > > Hmm, so you are not referring to OOM killed task but naturally exiting
> > > thread which is racing with the OOM killer. I guess you have a point
> > > there! Could you update the changelog with the above example and repost
> > > please?
> > > 
> > Yes and I resent that patch as v2.
> > 
> > I think that the same problem exists for any task_will_free_mem()-based
> > optimizations. Can we eliminate them because these optimized paths are not
> > handled by the OOM reaper which means that we have no means other than
> > "[PATCH 5/6] mm,oom: Re-enable OOM killer using timers." ?
> 
> Well, only oom_kill_process usage of task_will_free_mem might be a
> problem because out_of_memory operates on the current task so it must be
> in the allocation path and access to memory reserves should help it to
> continue.

Allowing access to memory reserves by task_will_free_mem(current) in
out_of_memory() will help current to continue, but that does not guarantee
that current will not be later blocked at down_read(&current->mm->mmap_sem).
It is possible that one of threads sharing current thread's memory is calling
out_of_memory() from mmap() and is waiting for current to set
current->mm = NULL.

> Wrt. oom_kill_process this will be more tricky. I guess we want to
> teach oom_reaper to operate on such a task which would be a more robust
> solution than removing the check altogether.
> 

Thus, I think there is no difference between task_will_free_mem(current)
case and task_will_free_mem(p) case. We want to teach the OOM reaper to
operate whenever TIF_MEMDIE is set. But this means that we want
mm_is_reapable() check because there might be !SIGKILL && !PF_EXITING
threads when we run these optimized paths. We will need to use timer
if mm_is_reapable() == false after all.

Why don't you accept timer based workaround now, even if you have a plan
to update the OOM reaper for handling these optimized paths?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm,oom: don't abort on exiting processes when selecting a victim.
  2016-02-17 15:29             ` Tetsuo Handa
@ 2016-02-17 16:17               ` Michal Hocko
  2016-02-18 11:21                 ` Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2016-02-17 16:17 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Thu 18-02-16 00:29:35, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > > > Please see http://lkml.kernel.org/r/201602151958.HCJ48972.FFOFOLMHSQVJtO@I-love.SAKURA.ne.jp .
> > > > 
> > > > I have missed this one. Reading...
> > > > 
> > > > Hmm, so you are not referring to OOM killed task but naturally exiting
> > > > thread which is racing with the OOM killer. I guess you have a point
> > > > there! Could you update the changelog with the above example and repost
> > > > please?
> > > > 
> > > Yes and I resent that patch as v2.
> > > 
> > > I think that the same problem exists for any task_will_free_mem()-based
> > > optimizations. Can we eliminate them because these optimized paths are not
> > > handled by the OOM reaper which means that we have no means other than
> > > "[PATCH 5/6] mm,oom: Re-enable OOM killer using timers." ?
> > 
> > Well, only oom_kill_process usage of task_will_free_mem might be a
> > problem because out_of_memory operates on the current task so it must be
> > in the allocation path and access to memory reserves should help it to
> > continue.
> 
> Allowing access to memory reserves by task_will_free_mem(current) in
> out_of_memory() will help current to continue, but that does not guarantee
> that current will not be later blocked at down_read(&current->mm->mmap_sem).
> It is possible that one of threads sharing current thread's memory is calling
> out_of_memory() from mmap() and is waiting for current to set
> current->mm = NULL.
>
> > Wrt. oom_kill_process this will be more tricky. I guess we want to
> > teach oom_reaper to operate on such a task which would be a more robust
> > solution than removing the check altogether.
> > 
> 
> Thus, I think there is no difference between task_will_free_mem(current)
> case and task_will_free_mem(p) case.

Yes you are right! I completely managed to confuse and misled myself.

> We want to teach the OOM reaper to
> operate whenever TIF_MEMDIE is set. But this means that we want
> mm_is_reapable() check because there might be !SIGKILL && !PF_EXITING
> threads when we run these optimized paths.

> We will need to use timer if mm_is_reapable() == false after all.

Or we should re-evaluate those heuristics for multithreaded processes.
Does it even make sense to shortcut and block the OOM killer if the
single thread is exiting? Only very small amount of memory gets released
during its exit anyway. Don't we want to catch only the group exit to
catch fatal_signal_pending -> exit_signals -> exit_mm -> allocation
cases? I am not really sure what to check for, to be honest though.

> Why don't you accept timer based workaround now, even if you have a plan
> to update the OOM reaper for handling these optimized paths?

Because I believe that the timeout based solutions are distracting from
a proper solution which would be based on actual algorithm/heurstic that
can be measured and evaluated. And because I can see future discussion
of whether $FOO or $BAR is a better timeout... I really do not see any
reason to rush into quick solutions now.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/6] mm,oom: exclude TIF_MEMDIE processes from candidates.
  2016-02-17 12:41   ` Michal Hocko
@ 2016-02-17 16:40     ` Tetsuo Handa
  2016-02-17 17:33       ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2016-02-17 16:40 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> On Wed 17-02-16 19:29:33, Tetsuo Handa wrote:
> > >From 142b08258e4c60834602e9b0a734564208bc6397 Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date: Wed, 17 Feb 2016 16:29:29 +0900
> > Subject: [PATCH 1/6] mm,oom: exclude TIF_MEMDIE processes from candidates.
> > 
> > The OOM reaper kernel thread can reclaim OOM victim's memory before
> > the victim releases it.
> 
> If this is aimed to be preparatory work, which I am not convinced about
> to be honest, then referring to oom reaper is confusing and misleading.
> 

OK. I removed it.

> > But it is possible that a TIF_MEMDIE thread
> > gets stuck at down_read(&mm->mmap_sem) in exit_mm() called from
> > do_exit() due to one of !TIF_MEMDIE threads doing a GFP_KERNEL
> > allocation between down_write(&mm->mmap_sem) and up_write(&mm->mmap_sem)
> > (e.g. mmap()). In that case, we need to use SysRq-f (manual invocation
> > of the OOM killer) because down_read_trylock(&mm->mmap_sem) by the OOM
> > reaper will not succeed.
> 
> But all the tasks sharing the mm with the oom victim will have
> fatal_signal_pending and so they will get access to memory reserves and
> that should help them to finish the allocation request. So the above
> text is misleading.
> 

Not true like explained in "[PATCH v2] mm,oom: don't abort on exiting
processes when selecting a victim.".

> If the down_read is blocked because down_write is blocked then a better
> solution is to make down_write_killable which has been already proposed.
> 
> > Also, there are other situations where the OOM
> > reaper cannot reap the victim's memory (e.g. CONFIG_MMU=n,
> 
> there was no clear evidence that this is a problem on !MMU
> configurations.
> 
> > victim's memory is shared with OOM-unkillable processes) which will
> > require manual SysRq-f for making progress.
> 
> Sharing mm with a task which is hidden from the OOM killer is a clear
> misconfiguration IMO.
>  

Misconfiguration and/or insane stress is no excuse to leave bugs unfixed.

> > However, it is possible that the OOM killer chooses the same OOM victim
> > forever which already has TIF_MEMDIE.
> 
> This can happen only for the sysrq+f case AFAICS. Regular OOM killer
> will stop scanning after it encounters the first TIF_MEMDIE task.
> If you want to handle the sysrq+f case then it should be imho explicit.
> Something I've tries here as patch 1/2
> http://lkml.kernel.org/r/1452632425-20191-1-git-send-email-mhocko@kernel.org
> which has been nacked. Maybe you can try again without
> fatal_signal_pending resp. task_will_free_mem checks which were
> controversial back then. Hiding this into find_lock_non_victim_task_mm
> is just making the code more obscure and harder to read.
> 
> > This is effectively disabling
> > SysRq-f. This patch excludes processes which has a TIF_MEMDIE thread
> >  from OOM victim candidates.
> > 
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> 
> In short I dislike this patch. It makes the code harder to read and the
> same can be solved more straightforward:

Your patch is not doing the same thing. test_tsk_thread_flag() needs to be
checked against all threads as with process_shares_mm(). Otherwise,
find_lock_task_mm() can select a TIF_MEMDIE thread.

Updated patch follows.

> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 078e07ec0906..68cc130c163b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -281,6 +281,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
>  		if (!is_sysrq_oom(oc))
>  			return OOM_SCAN_ABORT;
> +		else
> +			return OOM_SCAN_CONTINUE;
>  	}
>  	if (!task->mm)
>  		return OOM_SCAN_CONTINUE;
> @@ -719,6 +721,9 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  
>  			if (process_shares_mm(child, p->mm))
>  				continue;
> +
> +			if (is_sysrq_oom(oc) && test_tsk_thread_flag(child, TIF_MEMDIE))
> +				continue;
>  			/*
>  			 * oom_badness() returns 0 if the thread is unkillable
>  			 */
> -- 
> Michal Hocko
> SUSE Labs
> 
----------
>From 4d305f92e2527b6d86cd366952d598f9e95f095b Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Thu, 18 Feb 2016 01:16:54 +0900
Subject: [PATCH v2] mm,oom: exclude TIF_MEMDIE processes from candidates.

It is possible that a TIF_MEMDIE thread gets stuck at
down_read(&mm->mmap_sem) in exit_mm() called from do_exit() due to
one of !TIF_MEMDIE threads doing a GFP_KERNEL allocation between
down_write(&mm->mmap_sem) and up_write(&mm->mmap_sem) (e.g. mmap()).
In that case, we need to use SysRq-f (manual invocation of the OOM
killer) for making progress.

However, it is possible that the OOM killer chooses the same OOM victim
forever which already has TIF_MEMDIE. This is effectively disabling
SysRq-f. This patch excludes processes which has a TIF_MEMDIE thread
>from OOM victim candidates.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6e6abaf..f6f6b47 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -268,6 +268,21 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc,
 }
 #endif
 
+/*
+ * To determine whether a task is an OOM victim, we examine all the task's
+ * threads: if one of those has TIF_MEMDIE then the task is an OOM victim.
+ */
+static bool is_oom_victim(struct task_struct *p)
+{
+	struct task_struct *t;
+
+	for_each_thread(p, t) {
+		if (test_tsk_thread_flag(t, TIF_MEMDIE))
+			return true;
+	}
+	return false;
+}
+
 enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 			struct task_struct *task, unsigned long totalpages)
 {
@@ -278,9 +293,11 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 * This task already has access to memory reserves and is being killed.
 	 * Don't allow any other task to have access to the reserves.
 	 */
-	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
+	if (is_oom_victim(task)) {
 		if (!is_sysrq_oom(oc))
 			return OOM_SCAN_ABORT;
+		else
+			return OOM_SCAN_CONTINUE;
 	}
 	if (!task->mm || task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 		return OOM_SCAN_CONTINUE;
@@ -711,6 +728,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 
 			if (process_shares_mm(child, p->mm))
 				continue;
+			if (is_oom_victim(child))
+				continue;
 			/*
 			 * oom_badness() returns 0 if the thread is unkillable
 			 */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/6] mm,oom: exclude TIF_MEMDIE processes from candidates.
  2016-02-17 16:40     ` Tetsuo Handa
@ 2016-02-17 17:33       ` Michal Hocko
  2016-02-17 20:55         ` Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2016-02-17 17:33 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Thu 18-02-16 01:40:22, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 17-02-16 19:29:33, Tetsuo Handa wrote:
[...]
> > > victim's memory is shared with OOM-unkillable processes) which will
> > > require manual SysRq-f for making progress.
> > 
> > Sharing mm with a task which is hidden from the OOM killer is a clear
> > misconfiguration IMO.
> >  
> 
> Misconfiguration and/or insane stress is no excuse to leave bugs unfixed.

Such a misconfiguration requires administrator privileges and we do not
do not try really hard to prevent admins from shooting themselves into
foot. Especially if that makes the code much more complicated.
 
[...]
> > In short I dislike this patch. It makes the code harder to read and the
> > same can be solved more straightforward:
> 
> Your patch is not doing the same thing. test_tsk_thread_flag() needs to be
> checked against all threads as with process_shares_mm(). Otherwise,
> find_lock_task_mm() can select a TIF_MEMDIE thread.
> 
> Updated patch follows.
[...]
> >From 4d305f92e2527b6d86cd366952d598f9e95f095b Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Thu, 18 Feb 2016 01:16:54 +0900
> Subject: [PATCH v2] mm,oom: exclude TIF_MEMDIE processes from candidates.
> 
> It is possible that a TIF_MEMDIE thread gets stuck at
> down_read(&mm->mmap_sem) in exit_mm() called from do_exit() due to
> one of !TIF_MEMDIE threads doing a GFP_KERNEL allocation between
> down_write(&mm->mmap_sem) and up_write(&mm->mmap_sem) (e.g. mmap()).
> In that case, we need to use SysRq-f (manual invocation of the OOM
> killer) for making progress.
> 
> However, it is possible that the OOM killer chooses the same OOM victim
> forever which already has TIF_MEMDIE. This is effectively disabling
> SysRq-f. This patch excludes processes which has a TIF_MEMDIE thread
> >from OOM victim candidates.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/oom_kill.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 6e6abaf..f6f6b47 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -268,6 +268,21 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc,
>  }
>  #endif
>  
> +/*
> + * To determine whether a task is an OOM victim, we examine all the task's
> + * threads: if one of those has TIF_MEMDIE then the task is an OOM victim.
> + */
> +static bool is_oom_victim(struct task_struct *p)
> +{
> +	struct task_struct *t;
> +
> +	for_each_thread(p, t) {
> +		if (test_tsk_thread_flag(t, TIF_MEMDIE))
> +			return true;
> +	}
> +	return false;
> +}
> +
>  enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  			struct task_struct *task, unsigned long totalpages)
>  {
> @@ -278,9 +293,11 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  	 * This task already has access to memory reserves and is being killed.
>  	 * Don't allow any other task to have access to the reserves.
>  	 */
> -	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> +	if (is_oom_victim(task)) {

This will make the scanning much more time consuming (you will check
all the threads in the same thread group for each scanned thread!). I
do not think this is acceptable and it is not really needed for the
!is_sysrq_oom because we are scanning all the threads anyway.

Regarding the is_sysrq_oom case we might indeed select a thread
which doesn't have TIF_MEMDIE but it has been already (group) killed
but an attempt to catch that case is exactly what has been Nacked
previously when I tried to achieve the same thing and had TIF_MEMDIE ||
fatal_signal_pending check
(http://lkml.kernel.org/r/alpine.DEB.2.10.1601121639450.28831@chino.kir.corp.google.com).
This change will basically achieve the same (just in much more expansive
way) so I am not sure it overcomes the previous feedback.

>  		if (!is_sysrq_oom(oc))
>  			return OOM_SCAN_ABORT;
> +		else
> +			return OOM_SCAN_CONTINUE;
>  	}
>  	if (!task->mm || task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
>  		return OOM_SCAN_CONTINUE;
> @@ -711,6 +728,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  
>  			if (process_shares_mm(child, p->mm))
>  				continue;
> +			if (is_oom_victim(child))
> +				continue;
>  			/*
>  			 * oom_badness() returns 0 if the thread is unkillable
>  			 */
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/6] mm,oom: exclude TIF_MEMDIE processes from candidates.
  2016-02-17 17:33       ` Michal Hocko
@ 2016-02-17 20:55         ` Tetsuo Handa
  0 siblings, 0 replies; 29+ messages in thread
From: Tetsuo Handa @ 2016-02-17 20:55 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> > From 4d305f92e2527b6d86cd366952d598f9e95f095b Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date: Thu, 18 Feb 2016 01:16:54 +0900
> > Subject: [PATCH v2] mm,oom: exclude TIF_MEMDIE processes from candidates.
> > 
> > It is possible that a TIF_MEMDIE thread gets stuck at
> > down_read(&mm->mmap_sem) in exit_mm() called from do_exit() due to
> > one of !TIF_MEMDIE threads doing a GFP_KERNEL allocation between
> > down_write(&mm->mmap_sem) and up_write(&mm->mmap_sem) (e.g. mmap()).
> > In that case, we need to use SysRq-f (manual invocation of the OOM
> > killer) for making progress.
> > 
> > However, it is possible that the OOM killer chooses the same OOM victim
> > forever which already has TIF_MEMDIE. This is effectively disabling
> > SysRq-f. This patch excludes processes which has a TIF_MEMDIE thread
> > from OOM victim candidates.
> > 
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > ---
> >  mm/oom_kill.c | 21 ++++++++++++++++++++-
> >  1 file changed, 20 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 6e6abaf..f6f6b47 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -268,6 +268,21 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc,
> >  }
> >  #endif
> >  
> > +/*
> > + * To determine whether a task is an OOM victim, we examine all the task's
> > + * threads: if one of those has TIF_MEMDIE then the task is an OOM victim.
> > + */
> > +static bool is_oom_victim(struct task_struct *p)
> > +{
> > +	struct task_struct *t;
> > +
> > +	for_each_thread(p, t) {
> > +		if (test_tsk_thread_flag(t, TIF_MEMDIE))
> > +			return true;
> > +	}
> > +	return false;
> > +}
> > +
> >  enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
> >  			struct task_struct *task, unsigned long totalpages)
> >  {
> > @@ -278,9 +293,11 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
> >  	 * This task already has access to memory reserves and is being killed.
> >  	 * Don't allow any other task to have access to the reserves.
> >  	 */
> > -	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> > +	if (is_oom_victim(task)) {
> 
> This will make the scanning much more time consuming (you will check
> all the threads in the same thread group for each scanned thread!). I
> do not think this is acceptable and it is not really needed for the
> !is_sysrq_oom because we are scanning all the threads anyway.
> 

Yes, I know. What looks complicating to me is that select_bad_process()
uses for_each_process_thread() when has_intersects_mems_allowed() and
task_in_mem_cgroup() are using for_each_thread().

Can't we change select_bad_process() to use for_each_process() ?
What are cases where for_each_process_thread() makes difference from
for_each_process() ?

> Regarding the is_sysrq_oom case we might indeed select a thread
> which doesn't have TIF_MEMDIE but it has been already (group) killed
> but an attempt to catch that case is exactly what has been Nacked
> previously when I tried to achieve the same thing and had TIF_MEMDIE ||
> fatal_signal_pending check
> (http://lkml.kernel.org/r/alpine.DEB.2.10.1601121639450.28831@chino.kir.corp.google.com).
> This change will basically achieve the same (just in much more expansive
> way) so I am not sure it overcomes the previous feedback.
> 
> >  		if (!is_sysrq_oom(oc))
> >  			return OOM_SCAN_ABORT;
> > +		else
> > +			return OOM_SCAN_CONTINUE;
> >  	}
> >  	if (!task->mm || task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> >  		return OOM_SCAN_CONTINUE;
> > @@ -711,6 +728,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  
> >  			if (process_shares_mm(child, p->mm))
> >  				continue;
> > +			if (is_oom_victim(child))
> > +				continue;
> >  			/*
> >  			 * oom_badness() returns 0 if the thread is unkillable
> >  			 */
> > -- 
> > 1.8.3.1
> 
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/6] mm,oom: wait for OOM victims when using oom_kill_allocating_task == 1
  2016-02-17 13:32   ` Michal Hocko
@ 2016-02-18 10:45     ` Tetsuo Handa
  2016-02-18 12:20       ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2016-02-18 10:45 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> On Wed 17-02-16 19:36:36, Tetsuo Handa wrote:
> > From 0b36864d4100ecbdcaa2fc2d1927c9e270f1b629 Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date: Wed, 17 Feb 2016 16:37:59 +0900
> > Subject: [PATCH 6/6] mm,oom: wait for OOM victims when using oom_kill_allocating_task == 1
> >
> > Currently, out_of_memory() does not wait for existing TIF_MEMDIE threads
> > if /proc/sys/vm/oom_kill_allocating_task is set to 1. This can result in
> > killing more OOM victims than needed. We can wait for the OOM reaper to
> > reap memory used by existing TIF_MEMDIE threads if possible. If the OOM
> > reaper is not available, the system will be kept OOM stalled until an
> > OOM-unkillable thread does a GFP_FS allocation request and calls
> > oom_kill_allocating_task == 0 path.
> >
> > This patch changes oom_kill_allocating_task == 1 case to call
> > select_bad_process() in order to wait for existing TIF_MEMDIE threads.
>
> The primary motivation for oom_kill_allocating_task was to reduce the
> overhead of select_bad_process. See fe071d7e8aae ("oom: add
> oom_kill_allocating_task sysctl"). So this basically defeats the whole
> purpose of the feature.
>

I didn't know that. But I think that printk()ing all candidates much more
significantly degrades performance than scanning the tasklist. It would be
nice if setting /proc/sys/vm/oom_dump_tasks = N (N > 1) shows only top N
memory-hog processes.

> I am not user of this knob because it behaves absolutely randomly but
> IMHO we should simply do something like the following. It would be more
> compliant to the documentation and prevent from livelock which is
> currently possible (albeit very unlikely) when a single task consimes
> all the memory reserves and we keep looping over out_of_memory without
> any progress.
>
> But as I've said I have no idea whether somebody relies on the current
> behavior so this is more of a thinking loudly than proposing an actual
> patch at this point of time.

Maybe try warning messages for finding somebody using
oom_kill_allocating_task?

> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 078e07ec0906..7de84fb2dd03 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -706,6 +706,9 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
>  		message, task_pid_nr(p), p->comm, points);
>
> +	if (sysctl_oom_kill_allocating_task)
> +		goto kill;
> +

We have

  "Out of memory (oom_kill_allocating_task)"
  "Out of memory"
  "Memory cgroup out of memory"

but we don't have

  "Memory cgroup out of memory (oom_kill_allocating_task)"

.

I don't know whether we should use this condition for memcg OOM case.

>  	/*
>  	 * If any of p's children has a different mm and is eligible for kill,
>  	 * the one with the highest oom_badness() score is sacrificed for its
> @@ -734,6 +737,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	}
>  	read_unlock(&tasklist_lock);
>
> +kill:
>  	p = find_lock_task_mm(victim);
>  	if (!p) {
>  		put_task_struct(victim);
> @@ -888,6 +892,9 @@ bool out_of_memory(struct oom_control *oc)
>  	if (sysctl_oom_kill_allocating_task && current->mm &&
>  	    !oom_unkillable_task(current, NULL, oc->nodemask) &&
>  	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> +		if (test_thread_flag(TIF_MEMDIE))
> +			panic("Out of memory (oom_kill_allocating_task) not able to make a forward progress");
> +

If current thread got TIF_MEMDIE, current thread will not call out_of_memory()
again because current thread will exit the allocation (unless __GFP_NOFAIL)
due to use of ALLOC_NO_WATERMARKS.

This condition becomes true only when "some OOM-unkillable thread called
out_of_memory() and chose current as the OOM victim" && "current was
running between gfp_to_alloc_flags() in __alloc_pages_slowpath() and
!mutex_trylock(&oom_lock) in __alloc_pages_may_oom()" which is almost
impossibly triggerable. If we trigger this condition, I think it was
triggered by error by chance (rather than really unable to make a
forward progress).

>  		get_task_struct(current);
>  		oom_kill_process(oc, current, 0, totalpages, NULL,
>  				 "Out of memory (oom_kill_allocating_task)");

Anyway, we can forget about this [PATCH 6/6] for now.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm,oom: don't abort on exiting processes when selecting a victim.
  2016-02-17 16:17               ` Michal Hocko
@ 2016-02-18 11:21                 ` Tetsuo Handa
  0 siblings, 0 replies; 29+ messages in thread
From: Tetsuo Handa @ 2016-02-18 11:21 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> > We want to teach the OOM reaper to
> > operate whenever TIF_MEMDIE is set. But this means that we want
> > mm_is_reapable() check because there might be !SIGKILL && !PF_EXITING
> > threads when we run these optimized paths.
>
> > We will need to use timer if mm_is_reapable() == false after all.
>
> Or we should re-evaluate those heuristics for multithreaded processes.

TIF_MEMDIE heuristics are per a task_struct basis but OOM-kill operation
is per a signal_struct basis or per a mm_struct basis.

Since we set TIF_MEMDIE to only one thread (with a wrong assumption that
remaining threads will get TIF_MEMDIE due to fatal_signal_pending()),
we are bothered by corner cases.

> Does it even make sense to shortcut and block the OOM killer if the
> single thread is exiting?

Do we check for clone(!CLONE_SIGHAND && CLONE_VM) threads (i.e. walk the
process list) for checking whether it is really a single thread?
That would be mm_is_reapable().

>                           Only very small amount of memory gets released
> during its exit anyway.

Currently exit_mm() is called before exit_files() etc. are called.
Can we expect a single page of memory being released when such thread
gets stuck at down_read(&mm->mmap_sem) ?

>                         Don't we want to catch only the group exit to
> catch fatal_signal_pending -> exit_signals -> exit_mm -> allocation
> cases? I am not really sure what to check for, to be honest though.
>

I don't know what this line is saying.

> > Why don't you accept timer based workaround now, even if you have a plan
> > to update the OOM reaper for handling these optimized paths?
>
> Because I believe that the timeout based solutions are distracting from
> a proper solution which would be based on actual algorithm/heurstic that
> can be measured and evaluated. And because I can see future discussion
> of whether $FOO or $BAR is a better timeout... I really do not see any
> reason to rush into quick solutions now.

OOM-livelock bugs are caused by over-throttling based on optimistic
assumptions. This [PATCH 5/6] patch is for unthrottling in order to
guarantee forward progress (and eventually trigger kernel panic if
there is no more OOM-killable processes).

I can't see future discussion of whether $FOO or $BAR is a better timeout
because timeout based unthrottling should seldom occur. Even without the
OOM reaper, more than e.g. 99% of innocent OOM events would successfully
solve the OOM condition before this timeout expires. After we merge the
OOM reaper, more than e.g. 99% of malicious OOM events would successfully
solve the OOM condition before this timeout expires. Who can gather data
for discussing whether $FOO or $BAR is a better timeout? Only those who want
to explore this e.g. 1% possibility and those who hate any timeout would
want to disable this timeout.

If we make sure that timeout based unthrottling guarantees forward
progress, we can try to utilize memory reserves more aggressively.
For example, we can set TIF_MEMDIE on all fatal_signal_pending() threads
using a mm_struct chosen by the OOM killer. This will eliminate a wrong
assumption that remaining threads will get TIF_MEMDIE due to
fatal_signal_pending(). We had been too cowardly about use of memory
reserves because currently we have no means to refill the memory reserves.
If timeout based unthrottling kills next OOM victim (and the OOM reaper
reaps it), we can overcommit memory reserves (like we overcommit normal
memory).

I don't think we can manage without timeout based solutions.
I really do not see any reason not to accept [PATCH 5/6] now.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/6] mm,oom: wait for OOM victims when using oom_kill_allocating_task == 1
  2016-02-18 10:45     ` Tetsuo Handa
@ 2016-02-18 12:20       ` Michal Hocko
  0 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2016-02-18 12:20 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Thu 18-02-16 19:45:45, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 17-02-16 19:36:36, Tetsuo Handa wrote:
> > > From 0b36864d4100ecbdcaa2fc2d1927c9e270f1b629 Mon Sep 17 00:00:00 2001
> > > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Date: Wed, 17 Feb 2016 16:37:59 +0900
> > > Subject: [PATCH 6/6] mm,oom: wait for OOM victims when using oom_kill_allocating_task == 1
> > >
> > > Currently, out_of_memory() does not wait for existing TIF_MEMDIE threads
> > > if /proc/sys/vm/oom_kill_allocating_task is set to 1. This can result in
> > > killing more OOM victims than needed. We can wait for the OOM reaper to
> > > reap memory used by existing TIF_MEMDIE threads if possible. If the OOM
> > > reaper is not available, the system will be kept OOM stalled until an
> > > OOM-unkillable thread does a GFP_FS allocation request and calls
> > > oom_kill_allocating_task == 0 path.
> > >
> > > This patch changes oom_kill_allocating_task == 1 case to call
> > > select_bad_process() in order to wait for existing TIF_MEMDIE threads.
> >
> > The primary motivation for oom_kill_allocating_task was to reduce the
> > overhead of select_bad_process. See fe071d7e8aae ("oom: add
> > oom_kill_allocating_task sysctl"). So this basically defeats the whole
> > purpose of the feature.
> >
> 
> I didn't know that. But I think that printk()ing all candidates much more
> significantly degrades performance than scanning the tasklist.

I assume those who care do set oom_dump_tasks = 0.

> It would be
> nice if setting /proc/sys/vm/oom_dump_tasks = N (N > 1) shows only top N
> memory-hog processes.

You would need scanning of all tasks anyway and sorting etc... Not worth
bothering IMO.
 
[...]
> We have
> 
>   "Out of memory (oom_kill_allocating_task)"
>   "Out of memory"
>   "Memory cgroup out of memory"
> 
> but we don't have
> 
>   "Memory cgroup out of memory (oom_kill_allocating_task)"
> 
> I don't know whether we should use this condition for memcg OOM case.

memcg oom killer ignores follow oom_kill_allocating_task.
 
> >  	/*
> >  	 * If any of p's children has a different mm and is eligible for kill,
> >  	 * the one with the highest oom_badness() score is sacrificed for its
> > @@ -734,6 +737,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  	}
> >  	read_unlock(&tasklist_lock);
> >
> > +kill:
> >  	p = find_lock_task_mm(victim);
> >  	if (!p) {
> >  		put_task_struct(victim);
> > @@ -888,6 +892,9 @@ bool out_of_memory(struct oom_control *oc)
> >  	if (sysctl_oom_kill_allocating_task && current->mm &&
> >  	    !oom_unkillable_task(current, NULL, oc->nodemask) &&
> >  	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> > +		if (test_thread_flag(TIF_MEMDIE))
> > +			panic("Out of memory (oom_kill_allocating_task) not able to make a forward progress");
> > +
> 
> If current thread got TIF_MEMDIE, current thread will not call out_of_memory()
> again because current thread will exit the allocation (unless __GFP_NOFAIL)
> due to use of ALLOC_NO_WATERMARKS.

exactly __GFP_NOFAIL has to be handled properly.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/6] mm,oom: Re-enable OOM killer using timers.
  2016-02-17 13:20   ` Michal Hocko
@ 2016-04-09 14:00     ` Tetsuo Handa
  2016-04-09 14:04       ` Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2016-04-09 14:00 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg

Michal Hocko wrote:
> On Wed 17-02-16 19:34:46, Tetsuo Handa wrote:
> > >From 6f07b71c97766ec111d26c3424bded465ca48195 Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date: Wed, 17 Feb 2016 16:37:01 +0900
> > Subject: [PATCH 5/6] mm,oom: Re-enable OOM killer using timers.
> > 
> > We are trying to reduce the possibility of hitting OOM livelock by
> > introducing the OOM reaper, but there are situations where the OOM reaper
> > cannot reap the victim's memory. We want to introduce the OOM reaper as
> > simple as possible and make the OOM reaper better via incremental
> > development.
> > 
> > This patch adds a timer for handling corner cases where a TIF_MEMDIE
> > thread got stuck by reasons not handled by the initial version of the
> > OOM reaper. Since "mm,oom: exclude TIF_MEMDIE processes from candidates."
> > made sure that we won't choose the same OOM victim forever and this patch
> > makes sure that the kernel automatically presses SysRq-f upon OOM stalls,
> > we will not OOM stall forever as long as the OOM killer is called.
> 
> Can we actually start by incremental changes first and only get to this
> after we cannot find a proper way to fix existing issues?
> 
> I would like to at least make mmap_sem taken for write killable
> first. This should allow the oom_reaper to make a forward progress and
> allow the OOM killer to select another task when necessary (e.g. the
> victim wasn't sitting on a large amount of reclaimable memory). This
> has an advantage that the TIF_MEMDIE release is bound to a clearly
> defined action rather than a $RANDOM timemout which will always be hard
> to justify. We can talk about timeout based solutions after we are able
> to livelock the system even after all well defined actions will have
> failed. I really consider it premature right now.
> 

We can never fix the mmap_sem taken for write issue because

  (1) You hesitate to guarantee sending SIGKILL to all thread groups
      sharing the victim's memory by eliminating the shortcuts which
      do not send SIGKILL.

  (2) Even if you agree on guarantee sending SIGKILL to all thread
      groups sharing the victim's memory by eliminating the shortcuts,
      there might be OOM_SCORE_ADJ_MIN thread groups which means that
      we are not allowed to send SIGKILL after all.

Unlocking TIF_MEMDIE by the OOM reaper upon successful reaping is
the fastpath and unlocking TIF_MEMDIE by the timer is the slowpath.

Also, I think that the OOM reaper would be a too large change to backport
for 3.10-based and 2.6.32-based distributions.

There is no reason to add this patch which handles the slowpath right now.

> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > ---
> >  mm/oom_kill.c | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index ebc6764..fba2c62 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -45,6 +45,11 @@ int sysctl_oom_dump_tasks = 1;
> >  
> >  DEFINE_MUTEX(oom_lock);
> >  
> > +static void oomkiller_reset(unsigned long arg)
> > +{
> > +}
> > +static DEFINE_TIMER(oomkiller_victim_wait_timer, oomkiller_reset, 0, 0);
> > +
> >  #ifdef CONFIG_NUMA
> >  /**
> >   * has_intersects_mems_allowed() - check task eligiblity for kill
> > @@ -299,7 +304,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
> >  	 */
> >  	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> >  		if (!is_sysrq_oom(oc))
> > -			return OOM_SCAN_ABORT;
> > +			return timer_pending(&oomkiller_victim_wait_timer) ?
> > +				OOM_SCAN_ABORT : OOM_SCAN_CONTINUE;
> >  	}
> >  	if (!task->mm)
> >  		return OOM_SCAN_CONTINUE;
> > @@ -452,6 +458,8 @@ void mark_oom_victim(struct task_struct *tsk)
> >  	 */
> >  	__thaw_task(tsk);
> >  	atomic_inc(&oom_victims);
> > +	/* Make sure that we won't wait for this task forever. */
> > +	mod_timer(&oomkiller_victim_wait_timer, jiffies + 5 * HZ);
> >  }
> >  
> >  /**
> > -- 
> > 1.8.3.1
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/6] mm,oom: Re-enable OOM killer using timers.
  2016-04-09 14:00     ` Tetsuo Handa
@ 2016-04-09 14:04       ` Tetsuo Handa
  0 siblings, 0 replies; 29+ messages in thread
From: Tetsuo Handa @ 2016-04-09 14:04 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm, linux-kernel, oleg

Tetsuo Handa wrote:
> There is no reason to add this patch which handles the slowpath right now.
Oops.
There is no reason _not_ to add this patch which handles the slowpath right now.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2016-04-09 14:04 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-17 10:28 [PATCH 0/6] preparation for merging the OOM reaper Tetsuo Handa
2016-02-17 10:29 ` [PATCH 1/6] mm,oom: exclude TIF_MEMDIE processes from candidates Tetsuo Handa
2016-02-17 12:41   ` Michal Hocko
2016-02-17 16:40     ` Tetsuo Handa
2016-02-17 17:33       ` Michal Hocko
2016-02-17 20:55         ` Tetsuo Handa
2016-02-17 10:30 ` [PATCH 2/6] mm,oom: don't abort on exiting processes when selecting a victim Tetsuo Handa
2016-02-17 12:54   ` Michal Hocko
2016-02-17 13:07     ` Tetsuo Handa
2016-02-17 14:00       ` Michal Hocko
2016-02-17 14:39         ` Tetsuo Handa
2016-02-17 15:01           ` Michal Hocko
2016-02-17 15:29             ` Tetsuo Handa
2016-02-17 16:17               ` Michal Hocko
2016-02-18 11:21                 ` Tetsuo Handa
2016-02-17 10:32 ` [PATCH 3/6] mm,oom: exclude oom_task_origin processes if they are OOM victims Tetsuo Handa
2016-02-17 13:02   ` Michal Hocko
2016-02-17 10:33 ` [PATCH 4/6] mm,oom: exclude oom_task_origin processes if they are OOM-unkillable Tetsuo Handa
2016-02-17 13:10   ` Michal Hocko
2016-02-17 13:36     ` Tetsuo Handa
2016-02-17 13:44       ` Michal Hocko
2016-02-17 10:34 ` [PATCH 5/6] mm,oom: Re-enable OOM killer using timers Tetsuo Handa
2016-02-17 13:20   ` Michal Hocko
2016-04-09 14:00     ` Tetsuo Handa
2016-04-09 14:04       ` Tetsuo Handa
2016-02-17 10:36 ` [PATCH 6/6] mm,oom: wait for OOM victims when using oom_kill_allocating_task == 1 Tetsuo Handa
2016-02-17 13:32   ` Michal Hocko
2016-02-18 10:45     ` Tetsuo Handa
2016-02-18 12:20       ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).