Re: mm, oom_reaper: How to handle race with oom_killer_disable() ?

From: Michal Hocko <mhocko@kernel.org>
To: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: linux-mm@kvack.org, rientjes@google.com, oleg@redhat.com,
	vdavydov@parallels.com, mgorman@techsingularity.net,
	hughd@google.com, riel@redhat.com, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org
Subject: Re: mm, oom_reaper: How to handle race with oom_killer_disable() ?
Date: Tue, 21 Jun 2016 10:31:54 +0200	[thread overview]
Message-ID: <20160621083154.GA30848@dhcp22.suse.cz> (raw)
In-Reply-To: <20160613111943.GB6518@dhcp22.suse.cz>

On Mon 13-06-16 13:19:43, Michal Hocko wrote:
[...]
> I am trying to remember why we are disabling oom killer before kernel
> threads are frozen but not really sure about that right away.

OK, I guess I remember now. Say that a task would depend on a freezable
kernel thread to get to do_exit (stuck in wait_event etc...). We would
simply get stuck in oom_killer_disable for ever. So we need to address
it a different way.

One way would be what you are proposing but I guess it would be more
systematic to never call exit_oom_victim on a remote task.  After [1] we
have a solid foundation to rely only on MMF_REAPED even when TIF_MEMDIE
is set. It is more code than your patch so I can see a reason to go with
yours if the following one seems too large or ugly.

[1] http://lkml.kernel.org/r/1466426628-15074-1-git-send-email-mhocko@kernel.org

What do you think about the following?
---
>From f7ffecf944648f16d98fc70275041433ed68f7e4 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 20 Jun 2016 17:28:18 +0200
Subject: [PATCH] oom, suspend: fix oom_reaper vs. oom_killer_disable race

Tetsuo has reported the following potential oom_killer_disable vs.
oom_reaper race:

(1) freeze_processes() starts freezing user space threads.
(2) Somebody (maybe a kenrel thread) calls out_of_memory().
(3) The OOM killer calls mark_oom_victim() on a user space thread
    P1 which is already in __refrigerator().
(4) oom_killer_disable() sets oom_killer_disabled = true.
(5) P1 leaves __refrigerator() and enters do_exit().
(6) The OOM reaper calls exit_oom_victim(P1) before P1 can call
    exit_oom_victim(P1).
(7) oom_killer_disable() returns while P1 not yet finished
(8) P1 perform IO/interfere with the freezer.

This patch effectivelly reverts 449d777d7ad6 ("mm, oom_reaper: clear
TIF_MEMDIE for all tasks queued for oom_reaper"). We have a more
efficient way to achieve the same. We can rely on MMF_REAPED because
"mm, oom: hide mm which is shared with kthread or global init" made sure
that we skip tasks with MMF_REAPED even when they have TIF_MEMDIE.

The only missing part is that we have to handle __oom_reap_task failure
gracefully. Currently we mark the mm by MMF_OOM_NOT_REAPABLE to later
set MMF_REAPED should we fail __oom_reap_task again. This relies on
clearing TIF_MEMDIE to select that task again. We can achieve a similar
thing by requeuing the task to the oom_reaper_list. Introduce
oom_reaper_add_tail to add the task to the tail of the oom_reaper_list.
While we are at it reuse it for wake_oom_reaper as well because the
queueing should be FIFO.

In the result only the oom victim will be responsible to clear its
TIF_MEMDIE flag so the above race is not possible anymore. oom_reaper
will be allowed to play only with MMF_* flags so the responsibility
model will be more clear.

Fixes: 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper")
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 86 ++++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 55 insertions(+), 31 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4c21f744daa6..dbc62a69fee8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -558,6 +558,51 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	return ret;
 }
 
+static void oom_reaper_add_tail(struct task_struct *tsk)
+{
+	struct task_struct *q;
+
+	spin_lock(&oom_reaper_lock);
+	if (!oom_reaper_list) {
+		oom_reaper_list = tsk;
+		goto unlock;
+	}
+
+	for (q = oom_reaper_list; q->oom_reaper_list; q = q->oom_reaper_list)
+		;
+	q->oom_reaper_list = tsk;
+
+unlock:
+	spin_unlock(&oom_reaper_lock);
+}
+
+static inline bool requeue_oom_victim(struct task_struct *tsk)
+{
+	struct task_struct *p;
+	bool ret;
+
+	/*
+	 * If we've already tried to reap this task in the past and
+	 * failed it probably doesn't make much sense to try yet again
+	 * so hide the mm from the oom killer so that it can move on
+	 * to another task with a different mm struct.
+	 */
+	p = find_lock_task_mm(tsk);
+	if (!p)
+		return false;
+
+	if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags))
+		set_bit(MMF_OOM_REAPED, &p->mm->flags);
+	else
+		ret = true;
+
+	task_unlock(p);
+
+	if (ret)
+		oom_reaper_add_tail(tsk);
+	return ret;
+}
+
 #define MAX_OOM_REAP_RETRIES 10
 static void oom_reap_task(struct task_struct *tsk)
 {
@@ -567,40 +612,23 @@ static void oom_reap_task(struct task_struct *tsk)
 	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
-	if (attempts > MAX_OOM_REAP_RETRIES) {
-		struct task_struct *p;
+	tsk->oom_reaper_list = NULL;
 
+	if (attempts > MAX_OOM_REAP_RETRIES) {
 		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 				task_pid_nr(tsk), tsk->comm);
 
-		/*
-		 * If we've already tried to reap this task in the past and
-		 * failed it probably doesn't make much sense to try yet again
-		 * so hide the mm from the oom killer so that it can move on
-		 * to another task with a different mm struct.
-		 */
-		p = find_lock_task_mm(tsk);
-		if (p) {
-			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
-				pr_info("oom_reaper: giving up pid:%d (%s)\n",
-						task_pid_nr(tsk), tsk->comm);
-				set_bit(MMF_OOM_REAPED, &p->mm->flags);
-			}
-			task_unlock(p);
+		debug_show_all_locks();
+
+		if (requeue_oom_victim(tsk)) {
+			cond_resched();
+			return;
 		}
 
-		debug_show_all_locks();
+		pr_info("oom_reaper: giving up pid:%d (%s)\n",
+				task_pid_nr(tsk), tsk->comm);
 	}
 
-	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore or it is not a good candidate
-	 * for the oom victim right now because it cannot release its memory
-	 * itself nor by the oom reaper.
-	 */
-	tsk->oom_reaper_list = NULL;
-	exit_oom_victim(tsk);
-
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
 }
@@ -637,11 +665,7 @@ void wake_oom_reaper(struct task_struct *tsk)
 		return;
 
 	get_task_struct(tsk);
-
-	spin_lock(&oom_reaper_lock);
-	tsk->oom_reaper_list = oom_reaper_list;
-	oom_reaper_list = tsk;
-	spin_unlock(&oom_reaper_lock);
+	oom_reaper_add_tail(tsk);
 	wake_up(&oom_reaper_wait);
 }
 
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs