All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@kernel.org>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	Oleg Nesterov <oleg@redhat.com>,
	David Rientjes <rientjes@google.com>,
	Vladimir Davydov <vdavydov@parallels.com>,
	Michal Hocko <mhocko@suse.com>
Subject: [PATCH 03/10] oom: keep mm of the killed task available
Date: Thu, 28 Jul 2016 21:42:27 +0200	[thread overview]
Message-ID: <1469734954-31247-4-git-send-email-mhocko@kernel.org> (raw)
In-Reply-To: <1469734954-31247-1-git-send-email-mhocko@kernel.org>

From: Michal Hocko <mhocko@suse.com>

oom_reap_task has to call exit_oom_victim in order to make sure that the
oom vicim will not block the oom killer for ever. This is, however,
opening new problems (e.g oom_killer_disable exclusion - see
74070542099c ("oom, suspend: fix oom_reaper vs. oom_killer_disable
race")). exit_oom_victim should be only called from the victim's
context ideally.

One way to achieve this would be to rely on per mm_struct flags. We
already have MMF_OOM_REAPED to hide a task from the oom killer since
"mm, oom: hide mm which is shared with kthread or global init". The
problem is that the exit path:
do_exit
  exit_mm
    tsk->mm = NULL;
    mmput
      __mmput
    exit_oom_victim

doesn't guarantee that exit_oom_victim will get called in a bounded
amount of time. At least exit_aio depends on IO which might get blocked
due to lack of memory and who knows what else is lurking there.

This patch takes a different approach. We remember tsk->mm into the
signal_struct and bind it to the signal struct life time for all oom
victims. __oom_reap_task_mm as well as oom_scan_process_thread do not have
to rely on find_lock_task_mm anymore and they will have a reliable
reference to the mm struct. As a result all the oom specific
communication inside the OOM killer can be done via tsk->signal->oom_mm.

Increasing the signal_struct for something as unlikely as the oom
killer is far from ideal but this approach will make the code much more
reasonable and long term we even might want to move task->mm into the
signal_struct anyway. In the next step we might want to make the oom
killer exclusion and access to memory reserves completely independent
which would be also nice.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h |  2 ++
 kernel/fork.c         |  2 ++
 mm/oom_kill.c         | 51 +++++++++++++++++++++------------------------------
 3 files changed, 25 insertions(+), 30 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d9585771dc8b..8943546d52e7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -792,6 +792,8 @@ struct signal_struct {
 	short oom_score_adj;		/* OOM kill score adjustment */
 	short oom_score_adj_min;	/* OOM kill score adjustment min value.
 					 * Only settable by CAP_SYS_RESOURCE. */
+	struct mm_struct *oom_mm;	/* recorded mm when the thread group got
+					 * killed by the oom killer */
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
diff --git a/kernel/fork.c b/kernel/fork.c
index 191844a157df..7e9f83d5fe95 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -243,6 +243,8 @@ static inline void free_signal_struct(struct signal_struct *sig)
 {
 	taskstats_tgid_free(sig);
 	sched_autogroup_exit(sig);
+	if (sig->oom_mm)
+		mmdrop(sig->oom_mm);
 	kmem_cache_free(signal_cachep, sig);
 }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index a8c06883a142..7f09608405b7 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,14 +288,11 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 * any memory is quite low.
 	 */
 	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
-		struct task_struct *p = find_lock_task_mm(task);
+		struct mm_struct *mm = task->signal->oom_mm;
 		enum oom_scan_t ret = OOM_SCAN_ABORT;
 
-		if (p) {
-			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
-				ret = OOM_SCAN_CONTINUE;
-			task_unlock(p);
-		}
+		if (test_bit(MMF_OOM_REAPED, &mm->flags))
+			ret = OOM_SCAN_CONTINUE;
 
 		return ret;
 	}
@@ -526,11 +523,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	up_read(&mm->mmap_sem);
 
 	/*
-	 * This task can be safely ignored because we cannot do much more
-	 * to release its memory.
-	 */
-	set_bit(MMF_OOM_REAPED, &mm->flags);
-	/*
 	 * Drop our reference but make sure the mmput slow path is called from a
 	 * different context because we shouldn't risk we get stuck there and
 	 * put the oom_reaper out of the way.
@@ -545,20 +537,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 static void oom_reap_task(struct task_struct *tsk)
 {
 	int attempts = 0;
-	struct mm_struct *mm = NULL;
-	struct task_struct *p = find_lock_task_mm(tsk);
-
-	/*
-	 * Make sure we find the associated mm_struct even when the particular
-	 * thread has already terminated and cleared its mm.
-	 * We might have race with exit path so consider our work done if there
-	 * is no mm.
-	 */
-	if (!p)
-		goto done;
-	mm = p->mm;
-	atomic_inc(&mm->mm_count);
-	task_unlock(p);
+	struct mm_struct *mm = tsk->signal->oom_mm;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
 	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
@@ -567,8 +546,6 @@ static void oom_reap_task(struct task_struct *tsk)
 	if (attempts <= MAX_OOM_REAP_RETRIES)
 		goto done;
 
-	/* Ignore this mm because somebody can't call up_write(mmap_sem). */
-	set_bit(MMF_OOM_REAPED, &mm->flags);
 
 	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 		task_pid_nr(tsk), tsk->comm);
@@ -584,11 +561,14 @@ static void oom_reap_task(struct task_struct *tsk)
 	tsk->oom_reaper_list = NULL;
 	exit_oom_victim(tsk);
 
+	/*
+	 * Hide this mm from OOM killer because it has been either reaped or
+	 * somebody can't call up_write(mmap_sem).
+	 */
+	set_bit(MMF_OOM_REAPED, &mm->flags);
+
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
-	/* Drop a reference taken above. */
-	if (mm)
-		mmdrop(mm);
 }
 
 static int oom_reaper(void *unused)
@@ -650,14 +630,25 @@ subsys_initcall(oom_init)
  *
  * Has to be called with oom_lock held and never after
  * oom has been disabled already.
+ *
+ * tsk->mm has to be non NULL and caller has to guarantee it is stable (either
+ * under task_lock or operate on the current).
  */
 void mark_oom_victim(struct task_struct *tsk)
 {
+	struct mm_struct *mm = tsk->mm;
+
 	WARN_ON(oom_killer_disabled);
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
+
 	atomic_inc(&tsk->signal->oom_victims);
+
+	/* oom_mm is bound to the signal struct life time. */
+	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
+		atomic_inc(&tsk->signal->oom_mm->mm_count);
+
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2016-07-28 19:42 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-28 19:42 [RFC PATCH 0/10] fortify oom killer even more Michal Hocko
2016-07-28 19:42 ` [PATCH 01/10] mm,oom_reaper: Reduce find_lock_task_mm() usage Michal Hocko
2016-07-28 19:42 ` [PATCH 02/10] mm,oom_reaper: Do not attempt to reap a task twice Michal Hocko
2016-07-28 19:42 ` Michal Hocko [this message]
2016-07-28 19:42 ` [PATCH 04/10] mm, oom: get rid of signal_struct::oom_victims Michal Hocko
2016-07-28 19:42 ` [PATCH 05/10] kernel, oom: fix potential pgd_lock deadlock from __mmdrop Michal Hocko
2016-07-28 19:42 ` [PATCH 06/10] oom, suspend: fix oom_killer_disable vs. pm suspend properly Michal Hocko
2016-07-28 19:42 ` [PATCH 07/10] mm, oom: enforce exit_oom_victim on current task Michal Hocko
2016-07-28 19:42 ` [PATCH 08/10] exit, oom: postpone exit_oom_victim to later Michal Hocko
2016-07-30  8:20   ` Tetsuo Handa
2016-07-31  9:35     ` Michal Hocko
2016-07-31 10:19       ` Michal Hocko
2016-08-01 10:46       ` Tetsuo Handa
2016-08-01 11:33         ` Michal Hocko
2016-08-02 10:32           ` Tetsuo Handa
2016-08-02 11:31             ` Michal Hocko
2016-07-28 19:42 ` [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Michal Hocko
2016-07-28 20:41   ` Michael S. Tsirkin
2016-07-29  6:04     ` Michal Hocko
2016-07-29 13:14       ` Michael S. Tsirkin
2016-07-29 13:35         ` Michal Hocko
2016-07-29 17:57           ` Michael S. Tsirkin
2016-07-31  9:44             ` Michal Hocko
2016-08-12  9:42               ` Michal Hocko
2016-08-12 13:21                 ` Oleg Nesterov
2016-08-12 14:41                   ` Michal Hocko
2016-08-12 16:05                     ` Oleg Nesterov
2016-08-12 15:57                   ` Paul E. McKenney
2016-08-12 16:09                     ` Oleg Nesterov
2016-08-12 16:26                       ` Paul E. McKenney
2016-08-12 16:23                     ` Michal Hocko
2016-08-13  0:15                   ` Michael S. Tsirkin
2016-08-14  8:41                     ` Michal Hocko
2016-08-14 16:57                       ` Michael S. Tsirkin
2016-08-14 23:06                         ` Michael S. Tsirkin
2016-08-15  9:49                           ` Michal Hocko
2016-08-17 16:58                             ` Michal Hocko
2016-08-22 13:03                   ` Michal Hocko
2016-08-22 21:01                     ` Michael S. Tsirkin
2016-08-23  7:55                       ` Michal Hocko
2016-08-23  9:06                         ` Michal Hocko
2016-08-23 12:54                           ` Michael S. Tsirkin
2016-08-24 16:42                           ` Michal Hocko
2016-08-12  9:43         ` Michal Hocko
2016-07-29 17:07   ` Oleg Nesterov
2016-07-31  9:11     ` Michal Hocko
2016-07-28 19:42 ` [PATCH 10/10] oom, oom_reaper: allow to reap mm shared by the kthreads Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1469734954-31247-4-git-send-email-mhocko@kernel.org \
    --to=mhocko@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=oleg@redhat.com \
    --cc=penguin-kernel@I-love.SAKURA.ne.jp \
    --cc=rientjes@google.com \
    --cc=vdavydov@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.