linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* mm, oom_reaper: How to handle race with oom_killer_disable() ?
@ 2016-06-10 14:23 Tetsuo Handa
  2016-06-13 11:19 ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Tetsuo Handa @ 2016-06-10 14:23 UTC (permalink / raw)
  To: mhocko
  Cc: linux-mm, rientjes, oleg, vdavydov, mgorman, hughd, riel, akpm,
	linux-kernel

I got a question about commit e26796066fdf929c ("oom: make oom_reaper freezable").

static int enter_state(suspend_state_t state) {
  pr_debug("PM: Preparing system for sleep (%s)\n", pm_states[state]);
  error = suspend_prepare(state) {
    error = suspend_freeze_processes() {
      error = freeze_processes();
      if (error)
        return error;
      error = freeze_kernel_threads();
    }
  }
}

int freeze_processes(void) {
  if (!pm_freezing)
    atomic_inc(&system_freezing_cnt); // system_freezing_cnt becomes non-0 due to pm_freezing == false.
  pr_info("Freezing user space processes ... ");
  pm_freezing = true;
  error = try_to_freeze_tasks(true) {
    for_each_process_thread(g, p) {
      if (p == current || !freeze_task(p)) // freeze_task(oom_reaper_th) is called.
        continue;
    }
  }
  if (!error && !oom_killer_disable())
    error = -EBUSY;
}

bool freeze_task(struct task_struct *p) {
  if (!freezing(p) || frozen(p)) { // freezing(oom_reaper_th) is called.
    return false;
  }
  if (!(p->flags & PF_KTHREAD))
    fake_signal_wake_up(p);
  else
    wake_up_state(p, TASK_INTERRUPTIBLE);
  return true;
}

static inline bool freezing(struct task_struct *p) {
  if (likely(!atomic_read(&system_freezing_cnt))) // system_freezing_cnt is non-0 due to freeze_processes().
    return false;
  return freezing_slow_path(p); // freezing_slow_path(oom_reaper_th) is called.
}

bool freezing_slow_path(struct task_struct *p) {
  if (p->flags & (PF_NOFREEZE | PF_SUSPEND_TASK)) // false due to set_freezable() by oom_reaper().
    return false;
  if (test_thread_flag(TIF_MEMDIE)) // false due to unkillable thread.
    return false;
  if (pm_nosig_freezing || cgroup_freezing(p)) // pm_nosig_freezing == false for now. I assume cgroup_freezing(oom_reaper_th) is also false.
    return true;
  if (pm_freezing && !(p->flags & PF_KTHREAD)) // pm_freezing == true due to freeze_processes(). But oom_reaper_th has PF_KTHREAD.
    return true;
  return false; // After all, freezing(oom_reaper_th) returns false for now.
}

Therefore, freeze_task(oom_reaper_th) from freeze_processes() is a no-op.

int freeze_kernel_threads(void) {
  pr_info("Freezing remaining freezable tasks ... ");
  pm_nosig_freezing = true;
  error = try_to_freeze_tasks(false) {
    for_each_process_thread(g, p) {
      if (p == current || !freeze_task(p)) // freeze_task(oom_reaper_th) is called again.
        continue;
    }
  }
}

bool freeze_task(struct task_struct *p) {
  if (!freezing(p) || frozen(p)) { // freezing(oom_reaper_th) is called again.
    return false;
  }
  if (!(p->flags & PF_KTHREAD))
    fake_signal_wake_up(p);
  else
    wake_up_state(p, TASK_INTERRUPTIBLE);
  return true;
}

This time, freezing(oom_reaper_th) == true due to pm_nosig_freezing == true
and frozen(oom_reaper_th) == false. Thus, wake_up_state() wakes up the
oom_reaper_th sleeping at wait_event_freezable(), and the oom_reaper_th
calls __refrigerator() via try_to_freeze().

Therefore, the OOM reaper kernel thread might clear TIF_MEMDIE after
oom_killer_disable() set oom_killer_disabled = true, doesn't it?
This in turn allows sequence shown below, doesn't it?

  (1) freeze_processes() starts freezing user space threads.
  (2) Somebody (maybe a kenrel thread) calls out_of_memory().
  (3) The OOM killer calls mark_oom_victim() on a user space thread
      P1 which is already in __refrigerator().
  (4) oom_killer_disable() sets oom_killer_disabled = true.
  (5) P1 leaves __refrigerator() and enters do_exit().
  (6) The OOM reaper calls exit_oom_victim(P1) before P1 can call
      exit_oom_victim(P1).
  (7) oom_killer_disable() returns while P1 is not yet frozen
      again (i.e. not yet marked as PF_FROZEN).
  (8) P1 perform IO/interfere with the freezer.

try_to_freeze_tasks(false) from freeze_kernel_threads() will freeze
P1 again, but it seems to me that freeze_kernel_threads() is not
always called when freeze_processes() suceeded.

Therefore, we need to do like

-	exit_oom_victim(tsk);
+	mutex_lock(&oom_lock);
+	if (!oom_killer_disabled)
+		exit_oom_victim(tsk);
+	mutex_unlock(&oom_lock);

in oom_reap_task(), don't we?

----------------------------------------
>From 331842e86a6778cfce6de13ae6c029bc5123a714 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 10 Jun 2016 19:46:26 +0900
Subject: [PATCH] mm, oom_reaper: close race with oom_killer_disable()

Commit e26796066fdf929c ("oom: make oom_reaper freezable") meant to close
race with oom_killer_disable() by adding set_freezable() to oom_reaper().
But freezable kernel threads are not frozen until freeze_kernel_threads()
is called, which occurs after oom_killer_disable() from freeze_processes()
is called. Therefore, oom_reaper() needs to check right before calling
exit_oom_victim() whether oom_killer_disable() is already in progress.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/oom_kill.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index acbc432..29f42ad 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -559,9 +559,15 @@ static void oom_reap_task(struct task_struct *tsk)
 	 * reasonably reclaimable memory anymore or it is not a good candidate
 	 * for the oom victim right now because it cannot release its memory
 	 * itself nor by the oom reaper.
+	 *
+	 * But be sure to check race with oom_killer_disable() because
+	 * oom_reaper() is frozen after oom_killer_disable() returned.
 	 */
 	tsk->oom_reaper_list = NULL;
-	exit_oom_victim(tsk);
+	mutex_lock(&oom_lock);
+	if (!oom_killer_disabled)
+		exit_oom_victim(tsk);
+	mutex_unlock(&oom_lock);
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
-- 
1.8.3.1
----------------------------------------

But we might be able to do like below patch rather than above patch.
If below approach is OK, "[PATCH 10/10] mm, oom: hide mm which is shared
with kthread or global init" will be able to call exit_oom_victim() when
can_oom_reap became false.

----------------------------------------
>From e2b7a68dcb2ed6ace7038be0f6865efcc27b51bd Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 10 Jun 2016 22:59:11 +0900
Subject: [PATCH] mm, oom: stop thawing victim threads

Commit e26796066fdf929c ("oom: make oom_reaper freezable") meant to close
race with oom_killer_disable() by adding set_freezable() to oom_reaper().
But freezable kernel threads are not frozen until freeze_kernel_threads()
is called, which occurs after oom_killer_disable() from freeze_processes()
is called. Therefore, oom_reaper() needs to check right before calling
exit_oom_victim() whether oom_killer_disable() is already in progress.

But doing such check introduces a branch where the OOM reaper fails to
clear TIF_MEMDIE. There is a different approach which does not introduce
such branch.

Currently we thaw only one thread even if that thread's mm is shared by
multiple threads. While that thread might release some memory by e.g.
closing file descriptors after arriving at do_exit(), it won't help much
if such resource is shared by multiple threads. Unless we thaw all threads
using that memory, we can't guarantee that that memory is released before
oom_killer_disable() returns. If the victim was selected by memcg OOM,
memory allocation requests after oom_killer_disabled became true will
likely succeed because OOM situation is local. But if the victim was
selected by global OOM, memory allocation requests after
oom_killer_disabled became true will unlikely succeed; we will see
allocation failures caused by oom_killer_disabled == true.

Then, rather than risking memory allocation failures caused by not waiting
until the victim's memory is released (because the thawed victim might be
able to call exit_oom_victim(tsk) before the OOM reaper starts reaping the
victim's memory, for mmput(mm) which is called from between tsk->mm = NULL
and exit_oom_victim(tsk) in exit_mm() will not wait because there are other
still-frozen threads sharing that memory), waiting until the victim's
memory is reaped by the OOM reaper (by not thawing the victim thread)
will help reducing such risk.

This heuristic depends on an assumption that amount of memory released by
reaping mm is larger than that of by terminating one TIF_MEMDIE thread.
Since we assume that mm_struct is the primary source of memory usage, the
OOM reaper considers that there is no reasonably reclaimable memory after
the OOM reaper reaped the victim's memory. So, it should worth trying.

This patch stops thawing the TIF_MEMDIE thread because the OOM reaper
will reap memory on behalf of that thread, and defers out_of_memory()
returning false after oom_killer_disabled became true till the OOM reaper
clears all pending TIF_MEMDIE. Since the OOM killer does not thaw victim
threads, we won't break oom_killer_disable() semantics even if somebody
else clears TIF_MEMDIE.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/oom_kill.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index acbc432..11dfb16 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -678,6 +678,7 @@ void mark_oom_victim(struct task_struct *tsk)
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
 	atomic_inc(&tsk->signal->oom_victims);
+#ifndef CONFIG_MMU
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
@@ -685,6 +686,7 @@ void mark_oom_victim(struct task_struct *tsk)
 	 * that TIF_MEMDIE tasks should be ignored.
 	 */
 	__thaw_task(tsk);
+#endif
 	atomic_inc(&oom_victims);
 }
 
@@ -923,8 +925,24 @@ bool out_of_memory(struct oom_control *oc)
 	unsigned int uninitialized_var(points);
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 
+#ifdef CONFIG_MMU
+	/*
+	 * If oom_killer_disable() is called, wait for the OOM reaper to reap
+	 * all victim's memory. Since oom_killer_disable() is called between
+	 * after all userspace threads except the one calling
+	 * oom_killer_disable() are frozen and before freezable kernel threads
+	 * are frozen, the OOM reaper is known to be not yet frozen. Also,
+	 * since anyone who calls out_of_memory() while oom_killer_disable() is
+	 * waiting for all victims to terminate is known to be a kernel thread,
+	 * it is better to retry allocation as long as the OOM reaper can find
+	 * reapable memory.
+	 */
+	if (oom_killer_disabled)
+		return atomic_read(&oom_victims);
+#else
 	if (oom_killer_disabled)
 		return false;
+#endif
 
 	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
 	if (freed > 0)
-- 
1.8.3.1
----------------------------------------

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: mm, oom_reaper: How to handle race with oom_killer_disable() ?
  2016-06-10 14:23 mm, oom_reaper: How to handle race with oom_killer_disable() ? Tetsuo Handa
@ 2016-06-13 11:19 ` Michal Hocko
  2016-06-21  8:31   ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2016-06-13 11:19 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, rientjes, oleg, vdavydov, mgorman, hughd, riel, akpm,
	linux-kernel

On Fri 10-06-16 23:23:36, Tetsuo Handa wrote:
[...]
>   (1) freeze_processes() starts freezing user space threads.
>   (2) Somebody (maybe a kenrel thread) calls out_of_memory().
>   (3) The OOM killer calls mark_oom_victim() on a user space thread
>       P1 which is already in __refrigerator().
>   (4) oom_killer_disable() sets oom_killer_disabled = true.
>   (5) P1 leaves __refrigerator() and enters do_exit().
>   (6) The OOM reaper calls exit_oom_victim(P1) before P1 can call
>       exit_oom_victim(P1).
>   (7) oom_killer_disable() returns while P1 is not yet frozen
>       again (i.e. not yet marked as PF_FROZEN).
>   (8) P1 perform IO/interfere with the freezer.

You are right. I missed that kernel threads are still alive when writing
e26796066fdf929c ("oom: make oom_reaper freezable").

I am trying to remember why we are disabling oom killer before kernel
threads are frozen but not really sure about that right away. I guess it
has something to do with freeze_kernel_threads being called from
different contexts as well so freeze_processes was just more convinient
and was OK for correctness at the time.

> try_to_freeze_tasks(false) from freeze_kernel_threads() will freeze
> P1 again, but it seems to me that freeze_kernel_threads() is not
> always called when freeze_processes() suceeded.
> 
> Therefore, we need to do like
> 
> -	exit_oom_victim(tsk);
> +	mutex_lock(&oom_lock);
> +	if (!oom_killer_disabled)
> +		exit_oom_victim(tsk);
> +	mutex_unlock(&oom_lock);
> 
> in oom_reap_task(), don't we?

I do not like this very much. I would rather make sure that all
freezable kernel threads are frozen when disabling the oom killer.

[...]

> But we might be able to do like below patch rather than above patch.
> If below approach is OK, "[PATCH 10/10] mm, oom: hide mm which is shared
> with kthread or global init" will be able to call exit_oom_victim() when
> can_oom_reap became false.

I believe this is not really needed. I will follow up on the 10/10
later.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: mm, oom_reaper: How to handle race with oom_killer_disable() ?
  2016-06-13 11:19 ` Michal Hocko
@ 2016-06-21  8:31   ` Michal Hocko
  2016-06-21 11:03     ` Tetsuo Handa
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2016-06-21  8:31 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, rientjes, oleg, vdavydov, mgorman, hughd, riel, akpm,
	linux-kernel

On Mon 13-06-16 13:19:43, Michal Hocko wrote:
[...]
> I am trying to remember why we are disabling oom killer before kernel
> threads are frozen but not really sure about that right away.

OK, I guess I remember now. Say that a task would depend on a freezable
kernel thread to get to do_exit (stuck in wait_event etc...). We would
simply get stuck in oom_killer_disable for ever. So we need to address
it a different way.

One way would be what you are proposing but I guess it would be more
systematic to never call exit_oom_victim on a remote task.  After [1] we
have a solid foundation to rely only on MMF_REAPED even when TIF_MEMDIE
is set. It is more code than your patch so I can see a reason to go with
yours if the following one seems too large or ugly.

[1] http://lkml.kernel.org/r/1466426628-15074-1-git-send-email-mhocko@kernel.org

What do you think about the following?
---
>From f7ffecf944648f16d98fc70275041433ed68f7e4 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 20 Jun 2016 17:28:18 +0200
Subject: [PATCH] oom, suspend: fix oom_reaper vs. oom_killer_disable race

Tetsuo has reported the following potential oom_killer_disable vs.
oom_reaper race:

(1) freeze_processes() starts freezing user space threads.
(2) Somebody (maybe a kenrel thread) calls out_of_memory().
(3) The OOM killer calls mark_oom_victim() on a user space thread
    P1 which is already in __refrigerator().
(4) oom_killer_disable() sets oom_killer_disabled = true.
(5) P1 leaves __refrigerator() and enters do_exit().
(6) The OOM reaper calls exit_oom_victim(P1) before P1 can call
    exit_oom_victim(P1).
(7) oom_killer_disable() returns while P1 not yet finished
(8) P1 perform IO/interfere with the freezer.

This patch effectivelly reverts 449d777d7ad6 ("mm, oom_reaper: clear
TIF_MEMDIE for all tasks queued for oom_reaper"). We have a more
efficient way to achieve the same. We can rely on MMF_REAPED because
"mm, oom: hide mm which is shared with kthread or global init" made sure
that we skip tasks with MMF_REAPED even when they have TIF_MEMDIE.

The only missing part is that we have to handle __oom_reap_task failure
gracefully. Currently we mark the mm by MMF_OOM_NOT_REAPABLE to later
set MMF_REAPED should we fail __oom_reap_task again. This relies on
clearing TIF_MEMDIE to select that task again. We can achieve a similar
thing by requeuing the task to the oom_reaper_list. Introduce
oom_reaper_add_tail to add the task to the tail of the oom_reaper_list.
While we are at it reuse it for wake_oom_reaper as well because the
queueing should be FIFO.

In the result only the oom victim will be responsible to clear its
TIF_MEMDIE flag so the above race is not possible anymore. oom_reaper
will be allowed to play only with MMF_* flags so the responsibility
model will be more clear.

Fixes: 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper")
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 86 ++++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 55 insertions(+), 31 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4c21f744daa6..dbc62a69fee8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -558,6 +558,51 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	return ret;
 }
 
+static void oom_reaper_add_tail(struct task_struct *tsk)
+{
+	struct task_struct *q;
+
+	spin_lock(&oom_reaper_lock);
+	if (!oom_reaper_list) {
+		oom_reaper_list = tsk;
+		goto unlock;
+	}
+
+	for (q = oom_reaper_list; q->oom_reaper_list; q = q->oom_reaper_list)
+		;
+	q->oom_reaper_list = tsk;
+
+unlock:
+	spin_unlock(&oom_reaper_lock);
+}
+
+static inline bool requeue_oom_victim(struct task_struct *tsk)
+{
+	struct task_struct *p;
+	bool ret;
+
+	/*
+	 * If we've already tried to reap this task in the past and
+	 * failed it probably doesn't make much sense to try yet again
+	 * so hide the mm from the oom killer so that it can move on
+	 * to another task with a different mm struct.
+	 */
+	p = find_lock_task_mm(tsk);
+	if (!p)
+		return false;
+
+	if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags))
+		set_bit(MMF_OOM_REAPED, &p->mm->flags);
+	else
+		ret = true;
+
+	task_unlock(p);
+
+	if (ret)
+		oom_reaper_add_tail(tsk);
+	return ret;
+}
+
 #define MAX_OOM_REAP_RETRIES 10
 static void oom_reap_task(struct task_struct *tsk)
 {
@@ -567,40 +612,23 @@ static void oom_reap_task(struct task_struct *tsk)
 	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
-	if (attempts > MAX_OOM_REAP_RETRIES) {
-		struct task_struct *p;
+	tsk->oom_reaper_list = NULL;
 
+	if (attempts > MAX_OOM_REAP_RETRIES) {
 		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 				task_pid_nr(tsk), tsk->comm);
 
-		/*
-		 * If we've already tried to reap this task in the past and
-		 * failed it probably doesn't make much sense to try yet again
-		 * so hide the mm from the oom killer so that it can move on
-		 * to another task with a different mm struct.
-		 */
-		p = find_lock_task_mm(tsk);
-		if (p) {
-			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
-				pr_info("oom_reaper: giving up pid:%d (%s)\n",
-						task_pid_nr(tsk), tsk->comm);
-				set_bit(MMF_OOM_REAPED, &p->mm->flags);
-			}
-			task_unlock(p);
+		debug_show_all_locks();
+
+		if (requeue_oom_victim(tsk)) {
+			cond_resched();
+			return;
 		}
 
-		debug_show_all_locks();
+		pr_info("oom_reaper: giving up pid:%d (%s)\n",
+				task_pid_nr(tsk), tsk->comm);
 	}
 
-	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore or it is not a good candidate
-	 * for the oom victim right now because it cannot release its memory
-	 * itself nor by the oom reaper.
-	 */
-	tsk->oom_reaper_list = NULL;
-	exit_oom_victim(tsk);
-
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
 }
@@ -637,11 +665,7 @@ void wake_oom_reaper(struct task_struct *tsk)
 		return;
 
 	get_task_struct(tsk);
-
-	spin_lock(&oom_reaper_lock);
-	tsk->oom_reaper_list = oom_reaper_list;
-	oom_reaper_list = tsk;
-	spin_unlock(&oom_reaper_lock);
+	oom_reaper_add_tail(tsk);
 	wake_up(&oom_reaper_wait);
 }
 
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: mm, oom_reaper: How to handle race with oom_killer_disable() ?
  2016-06-21  8:31   ` Michal Hocko
@ 2016-06-21 11:03     ` Tetsuo Handa
  2016-06-21 11:46       ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Tetsuo Handa @ 2016-06-21 11:03 UTC (permalink / raw)
  To: mhocko
  Cc: linux-mm, rientjes, oleg, vdavydov, mgorman, hughd, riel, akpm,
	linux-kernel

Michal Hocko wrote:
> On Mon 13-06-16 13:19:43, Michal Hocko wrote:
> [...]
> > I am trying to remember why we are disabling oom killer before kernel
> > threads are frozen but not really sure about that right away.
> 
> OK, I guess I remember now. Say that a task would depend on a freezable
> kernel thread to get to do_exit (stuck in wait_event etc...). We would
> simply get stuck in oom_killer_disable for ever. So we need to address
> it a different way.
> 
> One way would be what you are proposing but I guess it would be more
> systematic to never call exit_oom_victim on a remote task.  After [1] we
> have a solid foundation to rely only on MMF_REAPED even when TIF_MEMDIE
> is set. It is more code than your patch so I can see a reason to go with
> yours if the following one seems too large or ugly.
> 
> [1] http://lkml.kernel.org/r/1466426628-15074-1-git-send-email-mhocko@kernel.org
> 
> What do you think about the following?

I'm OK with not clearing TIF_MEMDIE from a remote task. But this patch is racy.

> @@ -567,40 +612,23 @@ static void oom_reap_task(struct task_struct *tsk)
>  	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
>  		schedule_timeout_idle(HZ/10);
>  
> -	if (attempts > MAX_OOM_REAP_RETRIES) {
> -		struct task_struct *p;
> +	tsk->oom_reaper_list = NULL;
>  
> +	if (attempts > MAX_OOM_REAP_RETRIES) {

attempts > MAX_OOM_REAP_RETRIES would mean that down_read_trylock()
continuously failed. But it does not guarantee that the offending task
shall not call up_write(&mm->mmap_sem) and arrives at mmput() from exit_mm()
(as well as other threads which are blocked at down_read(&mm->mmap_sem) in
exit_mm() by the offending task arrive at mmput() from exit_mm()) when the
OOM reaper was preempted at this point.

Therefore, find_lock_task_mm() in requeue_oom_victim() could return NULL and
the OOM reaper could fail to set MMF_OOM_REAPED (and find_lock_task_mm() in
oom_scan_process_thread() could return NULL and the OOM killer could fail to
select next OOM victim as well) when __mmput() got stuck.

So, from the point of view of correctness, there remains an unhandled race
window as long as you depend on find_lock_task_mm() not returning NULL.
You will again ask "does it really matter/occur", and I can't make progress.

>  		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
>  				task_pid_nr(tsk), tsk->comm);
>  

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: mm, oom_reaper: How to handle race with oom_killer_disable() ?
  2016-06-21 11:03     ` Tetsuo Handa
@ 2016-06-21 11:46       ` Michal Hocko
  2016-06-21 13:27         ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2016-06-21 11:46 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, rientjes, oleg, vdavydov, mgorman, hughd, riel, akpm,
	linux-kernel

On Tue 21-06-16 20:03:17, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 13-06-16 13:19:43, Michal Hocko wrote:
> > [...]
> > > I am trying to remember why we are disabling oom killer before kernel
> > > threads are frozen but not really sure about that right away.
> > 
> > OK, I guess I remember now. Say that a task would depend on a freezable
> > kernel thread to get to do_exit (stuck in wait_event etc...). We would
> > simply get stuck in oom_killer_disable for ever. So we need to address
> > it a different way.
> > 
> > One way would be what you are proposing but I guess it would be more
> > systematic to never call exit_oom_victim on a remote task.  After [1] we
> > have a solid foundation to rely only on MMF_REAPED even when TIF_MEMDIE
> > is set. It is more code than your patch so I can see a reason to go with
> > yours if the following one seems too large or ugly.
> > 
> > [1] http://lkml.kernel.org/r/1466426628-15074-1-git-send-email-mhocko@kernel.org
> > 
> > What do you think about the following?
> 
> I'm OK with not clearing TIF_MEMDIE from a remote task. But this patch is racy.
> 
> > @@ -567,40 +612,23 @@ static void oom_reap_task(struct task_struct *tsk)
> >  	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
> >  		schedule_timeout_idle(HZ/10);
> >  
> > -	if (attempts > MAX_OOM_REAP_RETRIES) {
> > -		struct task_struct *p;
> > +	tsk->oom_reaper_list = NULL;
> >  
> > +	if (attempts > MAX_OOM_REAP_RETRIES) {
> 
> attempts > MAX_OOM_REAP_RETRIES would mean that down_read_trylock()
> continuously failed. But it does not guarantee that the offending task
> shall not call up_write(&mm->mmap_sem) and arrives at mmput() from exit_mm()
> (as well as other threads which are blocked at down_read(&mm->mmap_sem) in
> exit_mm() by the offending task arrive at mmput() from exit_mm()) when the
> OOM reaper was preempted at this point.
> 
> Therefore, find_lock_task_mm() in requeue_oom_victim() could return NULL and
> the OOM reaper could fail to set MMF_OOM_REAPED (and find_lock_task_mm() in
> oom_scan_process_thread() could return NULL and the OOM killer could fail to
> select next OOM victim as well) when __mmput() got stuck.

Fair enough. As this would break no-lockup requirement we cannot go that
way. Let me think about it more.
 
> So, from the point of view of correctness, there remains an unhandled race
> window as long as you depend on find_lock_task_mm() not returning NULL.
> You will again ask "does it really matter/occur", and I can't make progress.

Sigh...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: mm, oom_reaper: How to handle race with oom_killer_disable() ?
  2016-06-21 11:46       ` Michal Hocko
@ 2016-06-21 13:27         ` Michal Hocko
  2016-06-21 15:32           ` Tetsuo Handa
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2016-06-21 13:27 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, rientjes, oleg, vdavydov, mgorman, hughd, riel, akpm,
	linux-kernel

On Tue 21-06-16 13:46:43, Michal Hocko wrote:
> On Tue 21-06-16 20:03:17, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Mon 13-06-16 13:19:43, Michal Hocko wrote:
> > > [...]
> > > > I am trying to remember why we are disabling oom killer before kernel
> > > > threads are frozen but not really sure about that right away.
> > > 
> > > OK, I guess I remember now. Say that a task would depend on a freezable
> > > kernel thread to get to do_exit (stuck in wait_event etc...). We would
> > > simply get stuck in oom_killer_disable for ever. So we need to address
> > > it a different way.
> > > 
> > > One way would be what you are proposing but I guess it would be more
> > > systematic to never call exit_oom_victim on a remote task.  After [1] we
> > > have a solid foundation to rely only on MMF_REAPED even when TIF_MEMDIE
> > > is set. It is more code than your patch so I can see a reason to go with
> > > yours if the following one seems too large or ugly.
> > > 
> > > [1] http://lkml.kernel.org/r/1466426628-15074-1-git-send-email-mhocko@kernel.org
> > > 
> > > What do you think about the following?
> > 
> > I'm OK with not clearing TIF_MEMDIE from a remote task. But this patch is racy.
> > 
> > > @@ -567,40 +612,23 @@ static void oom_reap_task(struct task_struct *tsk)
> > >  	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
> > >  		schedule_timeout_idle(HZ/10);
> > >  
> > > -	if (attempts > MAX_OOM_REAP_RETRIES) {
> > > -		struct task_struct *p;
> > > +	tsk->oom_reaper_list = NULL;
> > >  
> > > +	if (attempts > MAX_OOM_REAP_RETRIES) {
> > 
> > attempts > MAX_OOM_REAP_RETRIES would mean that down_read_trylock()
> > continuously failed. But it does not guarantee that the offending task
> > shall not call up_write(&mm->mmap_sem) and arrives at mmput() from exit_mm()
> > (as well as other threads which are blocked at down_read(&mm->mmap_sem) in
> > exit_mm() by the offending task arrive at mmput() from exit_mm()) when the
> > OOM reaper was preempted at this point.
> > 
> > Therefore, find_lock_task_mm() in requeue_oom_victim() could return NULL and
> > the OOM reaper could fail to set MMF_OOM_REAPED (and find_lock_task_mm() in
> > oom_scan_process_thread() could return NULL and the OOM killer could fail to
> > select next OOM victim as well) when __mmput() got stuck.
> 
> Fair enough. As this would break no-lockup requirement we cannot go that
> way. Let me think about it more.

Hmm, what about the following instead. It is rather a workaround than a
full flaged fix but it seems much more easier and shouldn't introduce
new issues.
---
>From 86bf010d2a6086491bb356494fab0e0fca80dee9 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 20 Jun 2016 17:28:18 +0200
Subject: [PATCH] oom, suspend: fix oom_reaper vs. oom_killer_disable race

Tetsuo has reported the following potential oom_killer_disable vs.
oom_reaper race:

(1) freeze_processes() starts freezing user space threads.
(2) Somebody (maybe a kenrel thread) calls out_of_memory().
(3) The OOM killer calls mark_oom_victim() on a user space thread
    P1 which is already in __refrigerator().
(4) oom_killer_disable() sets oom_killer_disabled = true.
(5) P1 leaves __refrigerator() and enters do_exit().
(6) The OOM reaper calls exit_oom_victim(P1) before P1 can call
    exit_oom_victim(P1).
(7) oom_killer_disable() returns while P1 not yet finished
(8) P1 perform IO/interfere with the freezer.

This situation is unfortunate. We cannot move oom_killer_disable after
all the freezable kernel threads are frozen because the oom victim might
depend on some of those kthreads to make a forward progress to exit so
we could deadlock. It is also far from trivial to teach the oom_reaper
to not call exit_oom_victim() because then we would lose a guarantee of
the OOM killer and oom_killer_disable forward progress because
exit_mm->mmput might block and never call exit_oom_victim.

It seems the easiest way forward is to workaround this race by calling
try_to_freeze_tasks again after oom_killer_disable. This will make sure
that all the tasks are frozen or it bails out.

Fixes: 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper")
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 kernel/power/process.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/power/process.c b/kernel/power/process.c
index df058bed53ce..0c2ee9761d57 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -146,6 +146,18 @@ int freeze_processes(void)
 	if (!error && !oom_killer_disable())
 		error = -EBUSY;
 
+	/*
+	 * There is a hard to fix race between oom_reaper kernel thread
+	 * and oom_killer_disable. oom_reaper calls exit_oom_victim
+	 * before the victim reaches exit_mm so try to freeze all the tasks
+	 * again and catch such a left over task.
+	 */
+	if (!error) {
+		pr_info("Double checking all user space processes after OOM killer disable... ");
+		error = try_to_freeze_tasks(true);
+		pr_cont("\n");
+	}
+
 	if (error)
 		thaw_processes();
 	return error;
-- 
2.8.1


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: mm, oom_reaper: How to handle race with oom_killer_disable() ?
  2016-06-21 13:27         ` Michal Hocko
@ 2016-06-21 15:32           ` Tetsuo Handa
  2016-06-21 17:46             ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Tetsuo Handa @ 2016-06-21 15:32 UTC (permalink / raw)
  To: mhocko
  Cc: linux-mm, rientjes, oleg, vdavydov, mgorman, hughd, riel, akpm,
	linux-kernel

Michal Hocko wrote:
> On Tue 21-06-16 13:46:43, Michal Hocko wrote:
> > On Tue 21-06-16 20:03:17, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Mon 13-06-16 13:19:43, Michal Hocko wrote:
> > > > [...]
> > > > > I am trying to remember why we are disabling oom killer before kernel
> > > > > threads are frozen but not really sure about that right away.
> > > > 
> > > > OK, I guess I remember now. Say that a task would depend on a freezable
> > > > kernel thread to get to do_exit (stuck in wait_event etc...). We would
> > > > simply get stuck in oom_killer_disable for ever. So we need to address
> > > > it a different way.
> > > > 
> > > > One way would be what you are proposing but I guess it would be more
> > > > systematic to never call exit_oom_victim on a remote task.  After [1] we
> > > > have a solid foundation to rely only on MMF_REAPED even when TIF_MEMDIE
> > > > is set. It is more code than your patch so I can see a reason to go with
> > > > yours if the following one seems too large or ugly.
> > > > 
> > > > [1] http://lkml.kernel.org/r/1466426628-15074-1-git-send-email-mhocko@kernel.org
> > > > 
> > > > What do you think about the following?
> > > 
> > > I'm OK with not clearing TIF_MEMDIE from a remote task. But this patch is racy.
> > > 
> > > > @@ -567,40 +612,23 @@ static void oom_reap_task(struct task_struct *tsk)
> > > >  	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
> > > >  		schedule_timeout_idle(HZ/10);
> > > >  
> > > > -	if (attempts > MAX_OOM_REAP_RETRIES) {
> > > > -		struct task_struct *p;
> > > > +	tsk->oom_reaper_list = NULL;
> > > >  
> > > > +	if (attempts > MAX_OOM_REAP_RETRIES) {
> > > 
> > > attempts > MAX_OOM_REAP_RETRIES would mean that down_read_trylock()
> > > continuously failed. But it does not guarantee that the offending task
> > > shall not call up_write(&mm->mmap_sem) and arrives at mmput() from exit_mm()
> > > (as well as other threads which are blocked at down_read(&mm->mmap_sem) in
> > > exit_mm() by the offending task arrive at mmput() from exit_mm()) when the
> > > OOM reaper was preempted at this point.
> > > 
> > > Therefore, find_lock_task_mm() in requeue_oom_victim() could return NULL and
> > > the OOM reaper could fail to set MMF_OOM_REAPED (and find_lock_task_mm() in
> > > oom_scan_process_thread() could return NULL and the OOM killer could fail to
> > > select next OOM victim as well) when __mmput() got stuck.
> > 
> > Fair enough. As this would break no-lockup requirement we cannot go that
> > way. Let me think about it more.
> 
> Hmm, what about the following instead. It is rather a workaround than a
> full flaged fix but it seems much more easier and shouldn't introduce
> new issues.

Yes, I think that will work. But I think below patch (marking signal_struct
to ignore TIF_MEMDIE instead of clearing TIF_MEMDIE from task_struct) on top of
current linux.git will implement no-lockup requirement. No race is possible unlike
"[PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init".

 include/linux/oom.h   |  1 +
 include/linux/sched.h |  2 ++
 mm/memcontrol.c       |  3 ++-
 mm/oom_kill.c         | 60 ++++++++++++++++++++++++++++++---------------------
 4 files changed, 40 insertions(+), 26 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8346952..f072c6c 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -69,6 +69,7 @@ static inline bool oom_task_origin(const struct task_struct *p)
 
 extern void mark_oom_victim(struct task_struct *tsk);
 
+extern bool task_is_reapable(struct task_struct *tsk);
 #ifdef CONFIG_MMU
 extern void try_oom_reaper(struct task_struct *tsk);
 #else
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada..9248f90 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -799,6 +799,7 @@ struct signal_struct {
 	 * oom
 	 */
 	bool oom_flag_origin;
+	bool oom_ignore_me;
 	short oom_score_adj;		/* OOM kill score adjustment */
 	short oom_score_adj_min;	/* OOM kill score adjustment min value.
 					 * Only settable by CAP_SYS_RESOURCE. */
@@ -1545,6 +1546,7 @@ struct task_struct {
 	/* unserialized, strictly 'current' */
 	unsigned in_execve:1; /* bit to tell LSMs we're in execve */
 	unsigned in_iowait:1;
+	unsigned oom_shortcut_done:1;
 #ifdef CONFIG_MEMCG
 	unsigned memcg_may_oom:1;
 #ifndef CONFIG_SLOB
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75e7440..af162f6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1275,7 +1275,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+	if (task_is_reapable(current) && !current->oom_shortcut_done) {
+		current->oom_shortcut_done = true;
 		mark_oom_victim(current);
 		try_oom_reaper(current);
 		goto unlock;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index acbc432..e20d889 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -149,7 +149,7 @@ static bool oom_unkillable_task(struct task_struct *p,
 	if (!has_intersects_mems_allowed(p, nodemask))
 		return true;
 
-	return false;
+	return p->signal->oom_ignore_me;
 }
 
 /**
@@ -555,15 +555,15 @@ static void oom_reap_task(struct task_struct *tsk)
 	}
 
 	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
+	 * Ignore TIF_MEMDIE because the task shouldn't be sitting on a
 	 * reasonably reclaimable memory anymore or it is not a good candidate
 	 * for the oom victim right now because it cannot release its memory
 	 * itself nor by the oom reaper.
 	 */
 	tsk->oom_reaper_list = NULL;
-	exit_oom_victim(tsk);
+	tsk->signal->oom_ignore_me = true;
 
-	/* Drop a reference taken by wake_oom_reaper */
+	/* Drop a reference taken by try_oom_reaper */
 	put_task_struct(tsk);
 }
 
@@ -589,7 +589,7 @@ static int oom_reaper(void *unused)
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+void try_oom_reaper(struct task_struct *tsk)
 {
 	if (!oom_reaper_th)
 		return;
@@ -610,13 +610,13 @@ static void wake_oom_reaper(struct task_struct *tsk)
 /* Check if we can reap the given task. This has to be called with stable
  * tsk->mm
  */
-void try_oom_reaper(struct task_struct *tsk)
+bool task_is_reapable(struct task_struct *tsk)
 {
 	struct mm_struct *mm = tsk->mm;
 	struct task_struct *p;
 
 	if (!mm)
-		return;
+		return false;
 
 	/*
 	 * There might be other threads/processes which are either not
@@ -639,12 +639,11 @@ void try_oom_reaper(struct task_struct *tsk)
 
 			/* Give up */
 			rcu_read_unlock();
-			return;
+			return false;
 		}
 		rcu_read_unlock();
 	}
-
-	wake_oom_reaper(tsk);
+	return true;
 }
 
 static int __init oom_init(void)
@@ -659,8 +658,10 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static void wake_oom_reaper(struct task_struct *tsk)
+bool task_is_reapable(struct task_struct *tsk)
 {
+	return tsk->mm &&
+		(fatal_signal_pending(tsk) || task_will_free_mem(tsk));
 }
 #endif
 
@@ -753,20 +754,28 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
 	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
+#ifdef CONFIG_MMU
+	if (task_is_reapable(p)) {
 		mark_oom_victim(p);
 		try_oom_reaper(p);
 		task_unlock(p);
 		put_task_struct(p);
 		return;
 	}
+#else
+	if (p->mm && task_will_free_mem(p)) {
+		mark_oom_victim(p);
+		task_unlock(p);
+		put_task_struct(p);
+		return;
+	}
+#endif
 	task_unlock(p);
 
 	if (__ratelimit(&oom_rs))
@@ -846,21 +855,22 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		if (same_thread_group(p, victim))
 			continue;
 		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-			/*
-			 * We cannot use oom_reaper for the mm shared by this
-			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
-			 */
-			can_oom_reap = false;
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 			continue;
-		}
+
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
+#ifdef CONFIG_MMU
+	p = find_lock_task_mm(victim);
+	if (p && task_is_reapable(p))
+		try_oom_reaper(victim);
+	else
+		victim->signal->oom_ignore_me = true;
+	if (p)
+		task_unlock(p);
+#endif
 
 	mmdrop(mm);
 	put_task_struct(victim);
@@ -939,8 +949,8 @@ bool out_of_memory(struct oom_control *oc)
 	 * But don't select if current has already released its mm and cleared
 	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
-	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
+	if (!current->oom_shortcut_done && task_is_reapable(current)) {
+		current->oom_shortcut_done = true;
 		mark_oom_victim(current);
 		try_oom_reaper(current);
 		return true;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: mm, oom_reaper: How to handle race with oom_killer_disable() ?
  2016-06-21 15:32           ` Tetsuo Handa
@ 2016-06-21 17:46             ` Michal Hocko
  2016-06-21 21:47               ` Tetsuo Handa
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2016-06-21 17:46 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, rientjes, oleg, vdavydov, mgorman, hughd, riel, akpm,
	linux-kernel

On Wed 22-06-16 00:32:29, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > Hmm, what about the following instead. It is rather a workaround than a
> > full flaged fix but it seems much more easier and shouldn't introduce
> > new issues.
> 
> Yes, I think that will work. But I think below patch (marking signal_struct
> to ignore TIF_MEMDIE instead of clearing TIF_MEMDIE from task_struct) on top of
> current linux.git will implement no-lockup requirement. No race is possible unlike
> "[PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init".

Not really. Because without the exit_oom_victim from oom_reaper you have
no guarantee that the oom_killer_disable will ever return. I have
mentioned that in the changelog. There is simply no guarantee the oom
victim will ever reach exit_mm->exit_oom_victim.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: mm, oom_reaper: How to handle race with oom_killer_disable() ?
  2016-06-21 17:46             ` Michal Hocko
@ 2016-06-21 21:47               ` Tetsuo Handa
  2016-06-22  6:40                 ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Tetsuo Handa @ 2016-06-21 21:47 UTC (permalink / raw)
  To: mhocko
  Cc: linux-mm, rientjes, oleg, vdavydov, mgorman, hughd, riel, akpm,
	linux-kernel

Michal Hocko wrote:
> On Wed 22-06-16 00:32:29, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > Hmm, what about the following instead. It is rather a workaround than a
> > > full flaged fix but it seems much more easier and shouldn't introduce
> > > new issues.
> > 
> > Yes, I think that will work. But I think below patch (marking signal_struct
> > to ignore TIF_MEMDIE instead of clearing TIF_MEMDIE from task_struct) on top of
> > current linux.git will implement no-lockup requirement. No race is possible unlike
> > "[PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init".
> 
> Not really. Because without the exit_oom_victim from oom_reaper you have
> no guarantee that the oom_killer_disable will ever return. I have
> mentioned that in the changelog. There is simply no guarantee the oom
> victim will ever reach exit_mm->exit_oom_victim.

Why? Since any allocation after setting oom_killer_disabled = true will be
forced to fail, nobody will be blocked on waiting for memory allocation. Thus,
the TIF_MEMDIE tasks will eventually reach exit_mm->exit_oom_victim, won't it?

The only possibility that the TIF_MEMDIE tasks won't reach exit_mm->exit_oom_victim
is __GFP_NOFAIL allocations failing to make forward progress even after
ALLOC_NO_WATERMARKS is used. But that is a different problem which I think
we can call panic() when __GFP_NOFAIL allocations failed after setting
oom_killer_disabled = true.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: mm, oom_reaper: How to handle race with oom_killer_disable() ?
  2016-06-21 21:47               ` Tetsuo Handa
@ 2016-06-22  6:40                 ` Michal Hocko
  2016-06-22  6:50                   ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2016-06-22  6:40 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, rientjes, oleg, vdavydov, mgorman, hughd, riel, akpm,
	linux-kernel

On Wed 22-06-16 06:47:48, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 22-06-16 00:32:29, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > [...]
> > > > Hmm, what about the following instead. It is rather a workaround than a
> > > > full flaged fix but it seems much more easier and shouldn't introduce
> > > > new issues.
> > > 
> > > Yes, I think that will work. But I think below patch (marking signal_struct
> > > to ignore TIF_MEMDIE instead of clearing TIF_MEMDIE from task_struct) on top of
> > > current linux.git will implement no-lockup requirement. No race is possible unlike
> > > "[PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init".
> > 
> > Not really. Because without the exit_oom_victim from oom_reaper you have
> > no guarantee that the oom_killer_disable will ever return. I have
> > mentioned that in the changelog. There is simply no guarantee the oom
> > victim will ever reach exit_mm->exit_oom_victim.
> 
> Why? Since any allocation after setting oom_killer_disabled = true will be
> forced to fail, nobody will be blocked on waiting for memory allocation. Thus,
> the TIF_MEMDIE tasks will eventually reach exit_mm->exit_oom_victim, won't it?

What if it gets blocked waiting for an operation which cannot make any
forward progress because it cannot proceed with an allocation (e.g.
an open coded allocation retry loop - not that uncommon when sending
a bio)? I mean if we want to guarantee a forward progress then there has
to be something to clear the flag no matter in what state the oom victim
is or give up on oom_killer_disable.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: mm, oom_reaper: How to handle race with oom_killer_disable() ?
  2016-06-22  6:40                 ` Michal Hocko
@ 2016-06-22  6:50                   ` Michal Hocko
  2016-06-22 10:57                     ` Tetsuo Handa
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2016-06-22  6:50 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, rientjes, oleg, vdavydov, mgorman, hughd, riel, akpm,
	linux-kernel

On Wed 22-06-16 08:40:15, Michal Hocko wrote:
> On Wed 22-06-16 06:47:48, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Wed 22-06-16 00:32:29, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > [...]
> > > > > Hmm, what about the following instead. It is rather a workaround than a
> > > > > full flaged fix but it seems much more easier and shouldn't introduce
> > > > > new issues.
> > > > 
> > > > Yes, I think that will work. But I think below patch (marking signal_struct
> > > > to ignore TIF_MEMDIE instead of clearing TIF_MEMDIE from task_struct) on top of
> > > > current linux.git will implement no-lockup requirement. No race is possible unlike
> > > > "[PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init".
> > > 
> > > Not really. Because without the exit_oom_victim from oom_reaper you have
> > > no guarantee that the oom_killer_disable will ever return. I have
> > > mentioned that in the changelog. There is simply no guarantee the oom
> > > victim will ever reach exit_mm->exit_oom_victim.
> > 
> > Why? Since any allocation after setting oom_killer_disabled = true will be
> > forced to fail, nobody will be blocked on waiting for memory allocation. Thus,
> > the TIF_MEMDIE tasks will eventually reach exit_mm->exit_oom_victim, won't it?
> 
> What if it gets blocked waiting for an operation which cannot make any
> forward progress because it cannot proceed with an allocation (e.g.
> an open coded allocation retry loop - not that uncommon when sending
> a bio)? I mean if we want to guarantee a forward progress then there has
> to be something to clear the flag no matter in what state the oom victim
> is or give up on oom_killer_disable.

That being said I guess the patch to try_to_freeze_tasks after
oom_killer_disable should be simple enough to go for now and stable
trees and we can come up with something less hackish later. I do not
like the fact that oom_killer_disable doesn't act as a full "barrier"
anymore.

What do you think?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: mm, oom_reaper: How to handle race with oom_killer_disable() ?
  2016-06-22  6:50                   ` Michal Hocko
@ 2016-06-22 10:57                     ` Tetsuo Handa
  2016-06-22 12:08                       ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Tetsuo Handa @ 2016-06-22 10:57 UTC (permalink / raw)
  To: mhocko
  Cc: linux-mm, rientjes, oleg, vdavydov, mgorman, hughd, riel, akpm,
	linux-kernel

Michal Hocko wrote:
> On Wed 22-06-16 08:40:15, Michal Hocko wrote:
> > On Wed 22-06-16 06:47:48, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Wed 22-06-16 00:32:29, Tetsuo Handa wrote:
> > > > > Michal Hocko wrote:
> > > > [...]
> > > > > > Hmm, what about the following instead. It is rather a workaround than a
> > > > > > full flaged fix but it seems much more easier and shouldn't introduce
> > > > > > new issues.
> > > > > 
> > > > > Yes, I think that will work. But I think below patch (marking signal_struct
> > > > > to ignore TIF_MEMDIE instead of clearing TIF_MEMDIE from task_struct) on top of
> > > > > current linux.git will implement no-lockup requirement. No race is possible unlike
> > > > > "[PATCH 10/10] mm, oom: hide mm which is shared with kthread or global init".
> > > > 
> > > > Not really. Because without the exit_oom_victim from oom_reaper you have
> > > > no guarantee that the oom_killer_disable will ever return. I have
> > > > mentioned that in the changelog. There is simply no guarantee the oom
> > > > victim will ever reach exit_mm->exit_oom_victim.
> > > 
> > > Why? Since any allocation after setting oom_killer_disabled = true will be
> > > forced to fail, nobody will be blocked on waiting for memory allocation. Thus,
> > > the TIF_MEMDIE tasks will eventually reach exit_mm->exit_oom_victim, won't it?
> > 
> > What if it gets blocked waiting for an operation which cannot make any
> > forward progress because it cannot proceed with an allocation (e.g.
> > an open coded allocation retry loop - not that uncommon when sending
> > a bio)? I mean if we want to guarantee a forward progress then there has
> > to be something to clear the flag no matter in what state the oom victim
> > is or give up on oom_killer_disable.

That sounds as if CONFIG_MMU=n kernels do OOM livelock at __mmput() regardless
of oom_killer_disabled.

> 
> That being said I guess the patch to try_to_freeze_tasks after
> oom_killer_disable should be simple enough to go for now and stable
> trees and we can come up with something less hackish later. I do not
> like the fact that oom_killer_disable doesn't act as a full "barrier"
> anymore.
> 
> What do you think?

I'm OK with calling try_to_freeze_tasks(true) again for Linux 4.6 and 4.7 kernels.

But if free memory is little such that oom_killer_disable() can not expect TIF_MEMDIE
threads to clear TIF_MEMDIE by themselves (and therefore has to depend on the OOM
reaper to clear TIF_MEMDIE on behalf of them after the OOM reaper reaped some memory),
subsequent operations would be as well blocked waiting for an operation which cannot
make any forward progress because it cannot proceed with an allocation. Then,
oom_killer_disable() returns false after some timeout (i.e. "do not try to suspend
when the system is almost OOM") will be a safer reaction.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: mm, oom_reaper: How to handle race with oom_killer_disable() ?
  2016-06-22 10:57                     ` Tetsuo Handa
@ 2016-06-22 12:08                       ` Michal Hocko
  2016-06-22 12:15                         ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2016-06-22 12:08 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, rientjes, oleg, vdavydov, mgorman, hughd, riel, akpm,
	linux-kernel

On Wed 22-06-16 19:57:17, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > That being said I guess the patch to try_to_freeze_tasks after
> > oom_killer_disable should be simple enough to go for now and stable
> > trees and we can come up with something less hackish later. I do not
> > like the fact that oom_killer_disable doesn't act as a full "barrier"
> > anymore.
> > 
> > What do you think?
> 
> I'm OK with calling try_to_freeze_tasks(true) again for Linux 4.6 and 4.7 kernels.

OK, I will resend the patch CC Rafael and stable.
 
> But if free memory is little such that oom_killer_disable() can not expect TIF_MEMDIE
> threads to clear TIF_MEMDIE by themselves (and therefore has to depend on the OOM
> reaper to clear TIF_MEMDIE on behalf of them after the OOM reaper reaped some memory),
> subsequent operations would be as well blocked waiting for an operation which cannot
> make any forward progress because it cannot proceed with an allocation. Then,
> oom_killer_disable() returns false after some timeout (i.e. "do not try to suspend
> when the system is almost OOM") will be a safer reaction.

Yes that is exactly what I meant by "oom_killer_disable has to give up"
alternative. pm suspend already has a notion of timeout for back off
and oom_killer_disable can use wait_even_timeout. But let's do that
separately.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: mm, oom_reaper: How to handle race with oom_killer_disable() ?
  2016-06-22 12:08                       ` Michal Hocko
@ 2016-06-22 12:15                         ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2016-06-22 12:15 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, rientjes, oleg, vdavydov, mgorman, hughd, riel, akpm,
	linux-kernel

On Wed 22-06-16 14:08:43, Michal Hocko wrote:
> On Wed 22-06-16 19:57:17, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > That being said I guess the patch to try_to_freeze_tasks after
> > > oom_killer_disable should be simple enough to go for now and stable
> > > trees and we can come up with something less hackish later. I do not
> > > like the fact that oom_killer_disable doesn't act as a full "barrier"
> > > anymore.
> > > 
> > > What do you think?
> > 
> > I'm OK with calling try_to_freeze_tasks(true) again for Linux 4.6 and 4.7 kernels.
> 
> OK, I will resend the patch CC Rafael and stable.

Ohh, I've just realized that 449d777d7ad6 ("mm, oom_reaper: clear
TIF_MEMDIE for all tasks queued for oom_reaper") went in in this merge
window. So I've asked to push it to the next 4.7 rc.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-06-22 12:17 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-10 14:23 mm, oom_reaper: How to handle race with oom_killer_disable() ? Tetsuo Handa
2016-06-13 11:19 ` Michal Hocko
2016-06-21  8:31   ` Michal Hocko
2016-06-21 11:03     ` Tetsuo Handa
2016-06-21 11:46       ` Michal Hocko
2016-06-21 13:27         ` Michal Hocko
2016-06-21 15:32           ` Tetsuo Handa
2016-06-21 17:46             ` Michal Hocko
2016-06-21 21:47               ` Tetsuo Handa
2016-06-22  6:40                 ` Michal Hocko
2016-06-22  6:50                   ` Michal Hocko
2016-06-22 10:57                     ` Tetsuo Handa
2016-06-22 12:08                       ` Michal Hocko
2016-06-22 12:15                         ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).