All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm, oom: don't set TIF_MEMDIE on a mm-less thread.
@ 2016-06-23 15:58 Tetsuo Handa
  2016-06-23 16:24 ` [PATCH v2] " Tetsuo Handa
  0 siblings, 1 reply; 10+ messages in thread
From: Tetsuo Handa @ 2016-06-23 15:58 UTC (permalink / raw)
  To: mhocko
  Cc: linux-mm, Tetsuo Handa, Michal Hocko, Oleg Nesterov,
	Vladimir Davydov, David Rientjes

Patch "mm, oom: fortify task_will_free_mem" removed p->mm != NULL test for
shortcut path in oom_kill_process(). But since commit f44666b04605d1c7
("mm,oom: speed up select_bad_process() loop") changed to iterate using
thread group leaders, the possibility of p->mm == NULL has increased
compared to when commit 83363b917a2982dd ("oom: make sure that TIF_MEMDIE
is set under task_lock") was proposed. On CONFIG_MMU=n kernels, nothing
will clear TIF_MEMDIE and the system can OOM livelock if TIF_MEMDIE was
by error set to a mm-less thread group leader.

Let's redo find_task_lock_mm() test after task_will_free_mem() returned
true.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4c21f74..846d5a7 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -839,9 +839,13 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
 	if (task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		wake_oom_reaper(p);
-		put_task_struct(p);
+		p = find_lock_task_mm(p);
+		if (p) {
+			mark_oom_victim(p);
+			task_unlock(p);
+			wake_oom_reaper(p);
+		}
+		put_task_struct(victim);
 		return;
 	}
 
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] mm, oom: don't set TIF_MEMDIE on a mm-less thread.
  2016-06-23 15:58 [PATCH] mm, oom: don't set TIF_MEMDIE on a mm-less thread Tetsuo Handa
@ 2016-06-23 16:24 ` Tetsuo Handa
  2016-06-23 22:58   ` Oleg Nesterov
  2016-06-24  9:54   ` Michal Hocko
  0 siblings, 2 replies; 10+ messages in thread
From: Tetsuo Handa @ 2016-06-23 16:24 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, mhocko, oleg, vdavydov, rientjes

I missed that victim != p case needs to use get_task_struct(). Patch updated.
----------------------------------------
>From 1819ec63b27df2d544f66482439e754d084cebed Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 24 Jun 2016 01:16:02 +0900
Subject: [PATCH v2] mm, oom: don't set TIF_MEMDIE on a mm-less thread.

Patch "mm, oom: fortify task_will_free_mem" removed p->mm != NULL test for
shortcut path in oom_kill_process(). But since commit f44666b04605d1c7
("mm,oom: speed up select_bad_process() loop") changed to iterate using
thread group leaders, the possibility of p->mm == NULL has increased
compared to when commit 83363b917a2982dd ("oom: make sure that TIF_MEMDIE
is set under task_lock") was proposed. On CONFIG_MMU=n kernels, nothing
will clear TIF_MEMDIE and the system can OOM livelock if TIF_MEMDIE was
by error set to a mm-less thread group leader.

Let's do steps for regular path except printing OOM killer messages and
sending SIGKILL.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4c21f74..0a19a24 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -839,9 +839,19 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
 	if (task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		wake_oom_reaper(p);
-		put_task_struct(p);
+		p = find_lock_task_mm(victim);
+		if (!p) {
+			put_task_struct(victim);
+			return;
+		} else if (victim != p) {
+			get_task_struct(p);
+			put_task_struct(victim);
+			victim = p;
+		}
+		mark_oom_victim(victim);
+		task_unlock(victim);
+		wake_oom_reaper(victim);
+		put_task_struct(victim);
 		return;
 	}
 
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] mm, oom: don't set TIF_MEMDIE on a mm-less thread.
  2016-06-23 16:24 ` [PATCH v2] " Tetsuo Handa
@ 2016-06-23 22:58   ` Oleg Nesterov
  2016-06-24  9:54   ` Michal Hocko
  1 sibling, 0 replies; 10+ messages in thread
From: Oleg Nesterov @ 2016-06-23 22:58 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, linux-mm, mhocko, vdavydov, rientjes

On 06/24, Tetsuo Handa wrote:
>
> On CONFIG_MMU=n kernels, nothing
> will clear TIF_MEMDIE and the system can OOM livelock if TIF_MEMDIE was
> by error set to a mm-less thread group leader.

and btw this needs more cleanups imo. I mean, the fact we pass task_struct
to wake_oom_reaper() looks, well, strange. But this is off-topic right now.

> @@ -839,9 +839,19 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
>  	 */
>  	if (task_will_free_mem(p)) {
> -		mark_oom_victim(p);
> -		wake_oom_reaper(p);
> -		put_task_struct(p);
> +		p = find_lock_task_mm(victim);
> +		if (!p) {

Well, this doesn't really matter... but imo

		victim = find_lock_task_mm(p);
		if (!victim) {

will look much more readable, and this way we won't depend on the
early "victim = p" initialization at the start.

> +			put_task_struct(victim);
> +			return;
> +		} else if (victim != p) {
> +			get_task_struct(p);
> +			put_task_struct(victim);
> +			victim = p;
> +		}

Tetsuo but this is horrible ;)

At least this needs a comment to explain _why_. Because this looks
"obviously unnecessary"; exit_oom_victim() does find_lock_task_mm()
too and "p" can exit right after we drop task_lock(). Not to mention
that task_will_free_mem() called find_lock_task_mm() right before
that, so this is sub-optimal in any case.

IOW, this should explain that we only need this for mark_oom_victim(),
and only if CONFIG_MMU=n. And this leads to other questions:

	- Why we can livelock in this case? This should be documented
	  too imo,

	  I have to admit I don't understand why. But yes, yes, sorry,
	  I ignored a lot of emails in this area :/

	- Why mark_oom_victim() can't check ->mm != NULL itself?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] mm, oom: don't set TIF_MEMDIE on a mm-less thread.
  2016-06-23 16:24 ` [PATCH v2] " Tetsuo Handa
  2016-06-23 22:58   ` Oleg Nesterov
@ 2016-06-24  9:54   ` Michal Hocko
  2016-06-24 10:56     ` Tetsuo Handa
  1 sibling, 1 reply; 10+ messages in thread
From: Michal Hocko @ 2016-06-24  9:54 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, oleg, vdavydov, rientjes

On Fri 24-06-16 01:24:46, Tetsuo Handa wrote:
> I missed that victim != p case needs to use get_task_struct(). Patch updated.
> ----------------------------------------
> >From 1819ec63b27df2d544f66482439e754d084cebed Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Fri, 24 Jun 2016 01:16:02 +0900
> Subject: [PATCH v2] mm, oom: don't set TIF_MEMDIE on a mm-less thread.
> 
> Patch "mm, oom: fortify task_will_free_mem" removed p->mm != NULL test for
> shortcut path in oom_kill_process(). But since commit f44666b04605d1c7
> ("mm,oom: speed up select_bad_process() loop") changed to iterate using
> thread group leaders, the possibility of p->mm == NULL has increased
> compared to when commit 83363b917a2982dd ("oom: make sure that TIF_MEMDIE
> is set under task_lock") was proposed. On CONFIG_MMU=n kernels, nothing
> will clear TIF_MEMDIE and the system can OOM livelock if TIF_MEMDIE was
> by error set to a mm-less thread group leader.
> 
> Let's do steps for regular path except printing OOM killer messages and
> sending SIGKILL.

I fully agree with Oleg. It would be much better to encapsulate this
into mark_oom_victim and guard it by ifdef NOMMU as this is nommu
specific with a big fat warning why we need it.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] mm, oom: don't set TIF_MEMDIE on a mm-less thread.
  2016-06-24  9:54   ` Michal Hocko
@ 2016-06-24 10:56     ` Tetsuo Handa
  2016-06-24 12:04       ` Michal Hocko
  0 siblings, 1 reply; 10+ messages in thread
From: Tetsuo Handa @ 2016-06-24 10:56 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, oleg, vdavydov, rientjes

Michal Hocko wrote:
> On Fri 24-06-16 01:24:46, Tetsuo Handa wrote:
> > I missed that victim != p case needs to use get_task_struct(). Patch updated.
> > ----------------------------------------
> > >From 1819ec63b27df2d544f66482439e754d084cebed Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date: Fri, 24 Jun 2016 01:16:02 +0900
> > Subject: [PATCH v2] mm, oom: don't set TIF_MEMDIE on a mm-less thread.
> > 
> > Patch "mm, oom: fortify task_will_free_mem" removed p->mm != NULL test for
> > shortcut path in oom_kill_process(). But since commit f44666b04605d1c7
> > ("mm,oom: speed up select_bad_process() loop") changed to iterate using
> > thread group leaders, the possibility of p->mm == NULL has increased
> > compared to when commit 83363b917a2982dd ("oom: make sure that TIF_MEMDIE
> > is set under task_lock") was proposed. On CONFIG_MMU=n kernels, nothing
> > will clear TIF_MEMDIE and the system can OOM livelock if TIF_MEMDIE was
> > by error set to a mm-less thread group leader.
> > 
> > Let's do steps for regular path except printing OOM killer messages and
> > sending SIGKILL.
> 
> I fully agree with Oleg. It would be much better to encapsulate this
> into mark_oom_victim and guard it by ifdef NOMMU as this is nommu
> specific with a big fat warning why we need it.

OK. But before doing so, which one ((A) or (B) shown below) do you prefer?


(A) Don't use task_will_free_mem(p) shortcut in oom_kill_process() if CONFIG_MMU=n.

    Since task_will_free_mem(p) == true where p is the largest memory consumer
    (with oom_score_adj taken into account) is not exiting smoothly, as with
    commit 6a618957ad17d8f4 ("mm: oom_kill: don't ignore oom score on exiting
    tasks") thought, it can be a sign of something bad (possibly OOM livelock) is
    happening. Thus, print the OOM killer messages anyway although all tasks
    which will be OOM killed are already killed/exiting (unless p has OOM killable
    children). This will help giving administrator a hint when the kernel hit
    OOM livelock.

----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4c21f74..e7d38f62 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -834,6 +834,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 					      DEFAULT_RATELIMIT_BURST);
 	bool can_oom_reap = true;
 
+#ifdef CONFIG_MMU
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
@@ -844,6 +845,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		put_task_struct(p);
 		return;
 	}
+#endif
 
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p);
----------

(B) Check mm in mark_oom_victim() if CONFIG_MMU=n.

    Since mark_oom_victim() is also called from current->mm && task_will_free_mem(current)
    shortcut in out_of_memory(), mark_oom_victim(current) needs to set TIF_MEMDIE on current
    if current->mm != NULL.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4c21f74..bf45666 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -668,9 +668,29 @@ subsys_initcall(oom_init)
 void mark_oom_victim(struct task_struct *tsk)
 {
 	WARN_ON(oom_killer_disabled);
+#ifdef CONFIG_MMU
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
+#else
+	/*
+	 * Make sure that we set TIF_MEMDIE on a thread with mm in order to
+	 * reduce possibility of hitting OOM livelock.
+	 */
+	task_lock(tsk);
+	if (!tsk->mm) {
+		task_unlock(tsk);
+		tsk = find_lock_task_mm(tsk);
+		if (!tsk)
+			return;
+	}
+	/* OOM killer might race with memcg OOM */
+	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) {
+		task_unlock(tsk);
+		return;
+	}
+	task_unlock(tsk);
+#endif
 	atomic_inc(&tsk->signal->oom_victims);
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] mm, oom: don't set TIF_MEMDIE on a mm-less thread.
  2016-06-24 10:56     ` Tetsuo Handa
@ 2016-06-24 12:04       ` Michal Hocko
  2016-06-24 16:19         ` Tetsuo Handa
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Hocko @ 2016-06-24 12:04 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, oleg, vdavydov, rientjes

On Fri 24-06-16 19:56:43, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 24-06-16 01:24:46, Tetsuo Handa wrote:
> > > I missed that victim != p case needs to use get_task_struct(). Patch updated.
> > > ----------------------------------------
> > > >From 1819ec63b27df2d544f66482439e754d084cebed Mon Sep 17 00:00:00 2001
> > > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Date: Fri, 24 Jun 2016 01:16:02 +0900
> > > Subject: [PATCH v2] mm, oom: don't set TIF_MEMDIE on a mm-less thread.
> > > 
> > > Patch "mm, oom: fortify task_will_free_mem" removed p->mm != NULL test for
> > > shortcut path in oom_kill_process(). But since commit f44666b04605d1c7
> > > ("mm,oom: speed up select_bad_process() loop") changed to iterate using
> > > thread group leaders, the possibility of p->mm == NULL has increased
> > > compared to when commit 83363b917a2982dd ("oom: make sure that TIF_MEMDIE
> > > is set under task_lock") was proposed. On CONFIG_MMU=n kernels, nothing
> > > will clear TIF_MEMDIE and the system can OOM livelock if TIF_MEMDIE was
> > > by error set to a mm-less thread group leader.
> > > 
> > > Let's do steps for regular path except printing OOM killer messages and
> > > sending SIGKILL.
> > 
> > I fully agree with Oleg. It would be much better to encapsulate this
> > into mark_oom_victim and guard it by ifdef NOMMU as this is nommu
> > specific with a big fat warning why we need it.
> 
> OK. But before doing so, which one ((A) or (B) shown below) do you prefer?
> 
> 
> (A) Don't use task_will_free_mem(p) shortcut in oom_kill_process() if CONFIG_MMU=n.
> 
>     Since task_will_free_mem(p) == true where p is the largest memory consumer
>     (with oom_score_adj taken into account) is not exiting smoothly, as with
>     commit 6a618957ad17d8f4 ("mm: oom_kill: don't ignore oom score on exiting
>     tasks") thought, it can be a sign of something bad (possibly OOM livelock) is
>     happening. Thus, print the OOM killer messages anyway although all tasks
>     which will be OOM killed are already killed/exiting (unless p has OOM killable
>     children). This will help giving administrator a hint when the kernel hit
>     OOM livelock.
[...]
> (B) Check mm in mark_oom_victim() if CONFIG_MMU=n.
> 
>     Since mark_oom_victim() is also called from current->mm && task_will_free_mem(current)
>     shortcut in out_of_memory(), mark_oom_victim(current) needs to set TIF_MEMDIE on current
>     if current->mm != NULL.

I think you are overcomplicating this. Why cannot we simply do the
following?
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4c21f744daa6..97be9324a58b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -671,6 +671,22 @@ void mark_oom_victim(struct task_struct *tsk)
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
+#ifndef CONFIG_MMU
+	/*
+	 * we shouldn't risk setting TIF_MEMDIE on a task which has passed its
+	 * exit_mm task->mm = NULL and exit_oom_victim otherwise it could
+	 * theoretically keep its TIF_MEMDIE for ever while waiting for a parent
+	 * to get it out of zombie state. MMU doesn't have this problem because
+	 * it has the oom_reaper to clear the flag asynchronously.
+	 */
+	task_lock(tsk);
+	if (!tsk->mm) {
+		clear_tsk_thread_flag(tsk, TIF_MEMDIE);
+		task_unlock(tsk);
+		return;
+	}
+	taks_unlock(tsk);
+#endif
 	atomic_inc(&tsk->signal->oom_victims);
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] mm, oom: don't set TIF_MEMDIE on a mm-less thread.
  2016-06-24 12:04       ` Michal Hocko
@ 2016-06-24 16:19         ` Tetsuo Handa
  2016-06-27 11:37           ` Michal Hocko
  0 siblings, 1 reply; 10+ messages in thread
From: Tetsuo Handa @ 2016-06-24 16:19 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, oleg, vdavydov, rientjes

Michal Hocko wrote:
> On Fri 24-06-16 19:56:43, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 24-06-16 01:24:46, Tetsuo Handa wrote:
> > > > I missed that victim != p case needs to use get_task_struct(). Patch updated.
> > > > ----------------------------------------
> > > > >From 1819ec63b27df2d544f66482439e754d084cebed Mon Sep 17 00:00:00 2001
> > > > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > > Date: Fri, 24 Jun 2016 01:16:02 +0900
> > > > Subject: [PATCH v2] mm, oom: don't set TIF_MEMDIE on a mm-less thread.
> > > > 
> > > > Patch "mm, oom: fortify task_will_free_mem" removed p->mm != NULL test for
> > > > shortcut path in oom_kill_process(). But since commit f44666b04605d1c7
> > > > ("mm,oom: speed up select_bad_process() loop") changed to iterate using
> > > > thread group leaders, the possibility of p->mm == NULL has increased
> > > > compared to when commit 83363b917a2982dd ("oom: make sure that TIF_MEMDIE
> > > > is set under task_lock") was proposed. On CONFIG_MMU=n kernels, nothing
> > > > will clear TIF_MEMDIE and the system can OOM livelock if TIF_MEMDIE was
> > > > by error set to a mm-less thread group leader.
> > > > 
> > > > Let's do steps for regular path except printing OOM killer messages and
> > > > sending SIGKILL.
> > > 
> > > I fully agree with Oleg. It would be much better to encapsulate this
> > > into mark_oom_victim and guard it by ifdef NOMMU as this is nommu
> > > specific with a big fat warning why we need it.
> > 
> > OK. But before doing so, which one ((A) or (B) shown below) do you prefer?
> > 
> > 
> > (A) Don't use task_will_free_mem(p) shortcut in oom_kill_process() if CONFIG_MMU=n.
> > 
> >     Since task_will_free_mem(p) == true where p is the largest memory consumer
> >     (with oom_score_adj taken into account) is not exiting smoothly, as with
> >     commit 6a618957ad17d8f4 ("mm: oom_kill: don't ignore oom score on exiting
> >     tasks") thought, it can be a sign of something bad (possibly OOM livelock) is
> >     happening. Thus, print the OOM killer messages anyway although all tasks
> >     which will be OOM killed are already killed/exiting (unless p has OOM killable
> >     children). This will help giving administrator a hint when the kernel hit
> >     OOM livelock.
> [...]
> > (B) Check mm in mark_oom_victim() if CONFIG_MMU=n.
> > 
> >     Since mark_oom_victim() is also called from current->mm && task_will_free_mem(current)
> >     shortcut in out_of_memory(), mark_oom_victim(current) needs to set TIF_MEMDIE on current
> >     if current->mm != NULL.
> 
> I think you are overcomplicating this. Why cannot we simply do the
> following?
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 4c21f744daa6..97be9324a58b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -671,6 +671,22 @@ void mark_oom_victim(struct task_struct *tsk)
>  	/* OOM killer might race with memcg OOM */
>  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
>  		return;
> +#ifndef CONFIG_MMU
> +	/*
> +	 * we shouldn't risk setting TIF_MEMDIE on a task which has passed its
> +	 * exit_mm task->mm = NULL and exit_oom_victim otherwise it could
> +	 * theoretically keep its TIF_MEMDIE for ever while waiting for a parent
> +	 * to get it out of zombie state. MMU doesn't have this problem because
> +	 * it has the oom_reaper to clear the flag asynchronously.
> +	 */
> +	task_lock(tsk);
> +	if (!tsk->mm) {
> +		clear_tsk_thread_flag(tsk, TIF_MEMDIE);
> +		task_unlock(tsk);
> +		return;
> +	}
> +	taks_unlock(tsk);

This makes mark_oom_victim(tsk) for tsk->mm == NULL a no-op unless tsk is
currently doing memory allocation. And it is possible that tsk is blocked
waiting for somebody else's memory allocation after returning from
exit_mm() from do_exit(), isn't it? Then, how is this better than current
code (i.e. sets TIF_MEMDIE to a mm-less thread group leader)?

> +#endif
>  	atomic_inc(&tsk->signal->oom_victims);
>  	/*
>  	 * Make sure that the task is woken up from uninterruptible sleep
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] mm, oom: don't set TIF_MEMDIE on a mm-less thread.
  2016-06-24 16:19         ` Tetsuo Handa
@ 2016-06-27 11:37           ` Michal Hocko
  2016-06-27 13:32             ` Tetsuo Handa
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Hocko @ 2016-06-27 11:37 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, oleg, vdavydov, rientjes

On Sat 25-06-16 01:19:12, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 4c21f744daa6..97be9324a58b 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -671,6 +671,22 @@ void mark_oom_victim(struct task_struct *tsk)
> >  	/* OOM killer might race with memcg OOM */
> >  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> >  		return;
> > +#ifndef CONFIG_MMU
> > +	/*
> > +	 * we shouldn't risk setting TIF_MEMDIE on a task which has passed its
> > +	 * exit_mm task->mm = NULL and exit_oom_victim otherwise it could
> > +	 * theoretically keep its TIF_MEMDIE for ever while waiting for a parent
> > +	 * to get it out of zombie state. MMU doesn't have this problem because
> > +	 * it has the oom_reaper to clear the flag asynchronously.
> > +	 */
> > +	task_lock(tsk);
> > +	if (!tsk->mm) {
> > +		clear_tsk_thread_flag(tsk, TIF_MEMDIE);
> > +		task_unlock(tsk);
> > +		return;
> > +	}
> > +	taks_unlock(tsk);
> 
> This makes mark_oom_victim(tsk) for tsk->mm == NULL a no-op unless tsk is
> currently doing memory allocation. And it is possible that tsk is blocked
> waiting for somebody else's memory allocation after returning from
> exit_mm() from do_exit(), isn't it? Then, how is this better than current
> code (i.e. sets TIF_MEMDIE to a mm-less thread group leader)?

Well, the whole point of the check is to not set the flag after we
could have passed exit_mm->exit_oom_victim and keep it for the rest of
(unbounded) victim life as there is nothing else to do so.
If the tsk is waiting for something then we are screwed same way we were
before. Or have I missed your point?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] mm, oom: don't set TIF_MEMDIE on a mm-less thread.
  2016-06-27 11:37           ` Michal Hocko
@ 2016-06-27 13:32             ` Tetsuo Handa
  2016-06-27 14:06               ` Michal Hocko
  0 siblings, 1 reply; 10+ messages in thread
From: Tetsuo Handa @ 2016-06-27 13:32 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, oleg, vdavydov, rientjes

Michal Hocko wrote:
> On Sat 25-06-16 01:19:12, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 4c21f744daa6..97be9324a58b 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -671,6 +671,22 @@ void mark_oom_victim(struct task_struct *tsk)
> > >  	/* OOM killer might race with memcg OOM */
> > >  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> > >  		return;
> > > +#ifndef CONFIG_MMU
> > > +	/*
> > > +	 * we shouldn't risk setting TIF_MEMDIE on a task which has passed its
> > > +	 * exit_mm task->mm = NULL and exit_oom_victim otherwise it could
> > > +	 * theoretically keep its TIF_MEMDIE for ever while waiting for a parent
> > > +	 * to get it out of zombie state. MMU doesn't have this problem because
> > > +	 * it has the oom_reaper to clear the flag asynchronously.
> > > +	 */
> > > +	task_lock(tsk);
> > > +	if (!tsk->mm) {
> > > +		clear_tsk_thread_flag(tsk, TIF_MEMDIE);
> > > +		task_unlock(tsk);
> > > +		return;
> > > +	}
> > > +	taks_unlock(tsk);
> > 
> > This makes mark_oom_victim(tsk) for tsk->mm == NULL a no-op unless tsk is
> > currently doing memory allocation. And it is possible that tsk is blocked
> > waiting for somebody else's memory allocation after returning from
> > exit_mm() from do_exit(), isn't it? Then, how is this better than current
> > code (i.e. sets TIF_MEMDIE to a mm-less thread group leader)?
> 
> Well, the whole point of the check is to not set the flag after we
> could have passed exit_mm->exit_oom_victim and keep it for the rest of
> (unbounded) victim life as there is nothing else to do so.

OK. Based on commit 3da88fb3bacfaa33 ("mm, oom: move GFP_NOFS check to
out_of_memory") and an assumption that any OOM-killed thread shall eventually
win the mutex_trylock(&oom_lock) competition in __alloc_pages_may_oom() no
matter how disturbing factors (e.g. scheduling priority) delay OOM-killed
threads, you prefer asking each OOM-killed thread to get TIF_MEMDIE via

  if (current->mm && task_will_free_mem(current))

shortcut in out_of_memory() by keeping

  if (task_will_free_mem(p))

shortcut in oom_kill_process() a no-op. Yes, it should be harmless.

But I prefer not to wait for each OOM-killed thread to win the
mutex_trylock(&oom_lock) competition in __alloc_pages_may_oom().
Setting TIF_MEMDIE at

  if (task_will_free_mem(p))

shortcut in oom_kill_process() can save somebody which got TIF_MEMDIE from
participating in the mutex_trylock(&oom_lock) competition which is needed for
calling

  if (current->mm && task_will_free_mem(current))

shortcut in out_of_memory().

> If the tsk is waiting for something then we are screwed same way we were
> before. Or have I missed your point?
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] mm, oom: don't set TIF_MEMDIE on a mm-less thread.
  2016-06-27 13:32             ` Tetsuo Handa
@ 2016-06-27 14:06               ` Michal Hocko
  0 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2016-06-27 14:06 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, oleg, vdavydov, rientjes

On Mon 27-06-16 22:32:17, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 25-06-16 01:19:12, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > [...]
> > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > index 4c21f744daa6..97be9324a58b 100644
> > > > --- a/mm/oom_kill.c
> > > > +++ b/mm/oom_kill.c
> > > > @@ -671,6 +671,22 @@ void mark_oom_victim(struct task_struct *tsk)
> > > >  	/* OOM killer might race with memcg OOM */
> > > >  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> > > >  		return;
> > > > +#ifndef CONFIG_MMU
> > > > +	/*
> > > > +	 * we shouldn't risk setting TIF_MEMDIE on a task which has passed its
> > > > +	 * exit_mm task->mm = NULL and exit_oom_victim otherwise it could
> > > > +	 * theoretically keep its TIF_MEMDIE for ever while waiting for a parent
> > > > +	 * to get it out of zombie state. MMU doesn't have this problem because
> > > > +	 * it has the oom_reaper to clear the flag asynchronously.
> > > > +	 */
> > > > +	task_lock(tsk);
> > > > +	if (!tsk->mm) {
> > > > +		clear_tsk_thread_flag(tsk, TIF_MEMDIE);
> > > > +		task_unlock(tsk);
> > > > +		return;
> > > > +	}
> > > > +	taks_unlock(tsk);
> > > 
> > > This makes mark_oom_victim(tsk) for tsk->mm == NULL a no-op unless tsk is
> > > currently doing memory allocation. And it is possible that tsk is blocked
> > > waiting for somebody else's memory allocation after returning from
> > > exit_mm() from do_exit(), isn't it? Then, how is this better than current
> > > code (i.e. sets TIF_MEMDIE to a mm-less thread group leader)?
> > 
> > Well, the whole point of the check is to not set the flag after we
> > could have passed exit_mm->exit_oom_victim and keep it for the rest of
> > (unbounded) victim life as there is nothing else to do so.
> 
> OK. Based on commit 3da88fb3bacfaa33 ("mm, oom: move GFP_NOFS check to
> out_of_memory") and an assumption that any OOM-killed thread shall eventually
> win the mutex_trylock(&oom_lock) competition in __alloc_pages_may_oom() no
> matter how disturbing factors (e.g. scheduling priority) delay OOM-killed
> threads, you prefer asking each OOM-killed thread to get TIF_MEMDIE via
> 
>   if (current->mm && task_will_free_mem(current))
> 
> shortcut in out_of_memory() by keeping
> 
>   if (task_will_free_mem(p))
> 
> shortcut in oom_kill_process() a no-op. Yes, it should be harmless.

OK, I understand your point finally. Thanks for the clarification! And
you are right, I really do not care all that much about the latency
here. All I am looking for is the most simplistic solution for the
potential, albeit highly unlikely, race for a configuration for which
nobody actually complained/reported a bug.
 
> But I prefer not to wait for each OOM-killed thread to win the
> mutex_trylock(&oom_lock) competition in __alloc_pages_may_oom().
> Setting TIF_MEMDIE at
> 
>   if (task_will_free_mem(p))
> 
> shortcut in oom_kill_process() can save somebody which got TIF_MEMDIE from
> participating in the mutex_trylock(&oom_lock) competition which is needed for
> calling
> 
>   if (current->mm && task_will_free_mem(current))
> 
> shortcut in out_of_memory().

The code is complex enough that keeping it simpler makes a lot of sense
to me. Your dances with the find_lock_task_mm really didn't make it
easier to follow IMHO. The explicit check at a single place seems more
obious and easier to maintain to me.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-06-27 14:06 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-23 15:58 [PATCH] mm, oom: don't set TIF_MEMDIE on a mm-less thread Tetsuo Handa
2016-06-23 16:24 ` [PATCH v2] " Tetsuo Handa
2016-06-23 22:58   ` Oleg Nesterov
2016-06-24  9:54   ` Michal Hocko
2016-06-24 10:56     ` Tetsuo Handa
2016-06-24 12:04       ` Michal Hocko
2016-06-24 16:19         ` Tetsuo Handa
2016-06-27 11:37           ` Michal Hocko
2016-06-27 13:32             ` Tetsuo Handa
2016-06-27 14:06               ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.