From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751698AbcFFN04 (ORCPT <rfc822;w@1wt.eu>);
	Mon, 6 Jun 2016 09:26:56 -0400
Received: from mail-wm0-f66.google.com ([74.125.82.66]:33454 "EHLO
	mail-wm0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750764AbcFFN0z (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Mon, 6 Jun 2016 09:26:55 -0400
Date: Mon, 6 Jun 2016 15:26:52 +0200
From: Michal Hocko <mhocko@kernel.org>
To: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: linux-mm@kvack.org, rientjes@google.com, oleg@redhat.com,
        vdavydov@parallels.com, akpm@linux-foundation.org,
        linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 10/10] mm, oom: hide mm which is shared with kthread
 or global init
Message-ID: <20160606132650.GI11895@dhcp22.suse.cz>
References: <1464945404-30157-1-git-send-email-mhocko@kernel.org>
 <1464945404-30157-11-git-send-email-mhocko@kernel.org>
 <201606040016.BFG17115.OFMLSJFOtHQOFV@I-love.SAKURA.ne.jp>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <201606040016.BFG17115.OFMLSJFOtHQOFV@I-love.SAKURA.ne.jp>
User-Agent: Mutt/1.6.0 (2016-04-01)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Sat 04-06-16 00:16:32, Tetsuo Handa wrote:
[...]
> Leaving current thread from out_of_memory() without clearing TIF_MEMDIE might
> cause OOM lockup, for there is no guarantee that current thread will not wait
> for locks in unkillable state after current memory allocation request completes
> (e.g. getname() followed by mutex_lock() shown at
> http://lkml.kernel.org/r/201509290118.BCJ43256.tSFFFMOLHVOJOQ@I-love.SAKURA.ne.jp ).

OK, so what do you think about the following. I am not entirely happy to
duplicate MMF_OOM_REAPED flags into other code paths but I guess we can
clean this up later.
---
>>From ffa5799390f2924882a9e077b0c59d0660a5c87a Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 30 May 2016 18:01:51 +0200
Subject: [PATCH] mm, oom: hide mm which is shared with kthread or global init

The only case where the oom_reaper is not triggered for the oom victim
is when it shares the memory with a kernel thread (aka use_mm) or with
the global init. After "mm, oom: skip vforked tasks from being selected"
the victim cannot be a vforked task of the global init so we are left
with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm users are quite
rare as well. In order to guarantee a forward progress for the OOM
killer make sure that this really rare cases will not get into the way
and hide the mm from the oom killer by setting MMF_OOM_REAPED flag for
it.

We cannot keep the TIF_MEMDIE for the victim so let's simply wait for a
while and then drop the flag for all victims except for the current task
which is guaranteed to be in the allocation path already and should be
able to use the memory reserve right away.

If the victim cannot terminate by then simply risk another oom victim
selection. Note that oom_scan_process_thread has to learn about this as
well and ignore any TIF_MEMDIE task if it has MMF_OOM_REAPED flag set
because the (prviously) current task might get stuck on the way to exit.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 38 +++++++++++++++++++++++++++++++++-----
 1 file changed, 33 insertions(+), 5 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index a5edec2c2984..ec99bbf43c13 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -283,10 +283,24 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 	/*
 	 * This task already has access to memory reserves and is being killed.
-	 * Don't allow any other task to have access to the reserves.
+	 * Don't allow any other task to have access to the reserves unless
+	 * the task has MMF_OOM_REAPED or it cleared its mm already.
+	 * If the access to memory reserves didn't help we should rather try to
+	 * kill somebody else or panic on no oom victim than loop with no way
+	 * forward.
 	 */
-	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims))
-		return OOM_SCAN_ABORT;
+	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
+		struct task_struct *p = find_lock_task_mm(task);
+		enum oom_scan_t ret = OOM_SCAN_CONTINUE;
+
+		if (p) {
+			if (!test_bit(MMF_OOM_REAPED, &p->mm->flags))
+				ret = OOM_SCAN_ABORT;
+			task_unlock(p);
+		}
+
+		return ret;
+	}
 
 	/*
 	 * If task is allocating a lot of memory and has been marked to be
@@ -908,9 +922,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			/*
 			 * We cannot use oom_reaper for the mm shared by this
 			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
+			 * memory might be still used. Hide the mm from the oom
+			 * killer to guarantee OOM forward progress.
 			 */
 			can_oom_reap = false;
+			set_bit(MMF_OOM_REAPED, &mm->flags);
+			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
+					task_pid_nr(victim), victim->comm,
+					task_pid_nr(p), p->comm);
 			continue;
 		}
 		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
@@ -922,8 +941,17 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
+	if (can_oom_reap) {
 		wake_oom_reaper(victim);
+	} else if (victim != current) {
+		/*
+		 * If we want to guarantee a forward progress we cannot keep
+		 * the oom victim TIF_MEMDIE here. Sleep for a while and then
+		 * drop the flag to make sure another victim can be selected.
+		 */
+		schedule_timeout_killable(HZ);
+		exit_oom_victim(victim);
+	}
 
 	mmdrop(mm);
 	put_task_struct(victim);
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs