All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
@ 2016-06-24 11:02 Tetsuo Handa
  2016-06-24 12:39 ` Michal Hocko
  2016-06-24 21:56 ` Oleg Nesterov
  0 siblings, 2 replies; 34+ messages in thread
From: Tetsuo Handa @ 2016-06-24 11:02 UTC (permalink / raw)
  To: mhocko
  Cc: linux-mm, Tetsuo Handa, Michal Hocko, Oleg Nesterov,
	Vladimir Davydov, David Rientjes

Currently, the OOM reaper calls exit_oom_victim() on remote TIF_MEMDIE
thread after an OOM reap attempt was made. This behavior is intended
for allowing oom_scan_process_thread() to select next OOM victim by
making atomic_read(&task->signal->oom_victims) == 0.

But since threads can be blocked for unbounded period at __mmput() from
mmput() from exit_mm() from do_exit(), we can't risk the OOM reaper
being blocked for unbounded period waiting for TIF_MEMDIE threads.
Therefore, when we hit a situation that a TIF_MEMDIE thread which is
the only thread of that thread group reached tsk->mm = NULL line in
exit_mm() from do_exit() before __oom_reap_task() finds a mm via
find_lock_task_mm(), oom_reap_task() does not wait for the TIF_MEMDIE
thread to return from __mmput() and instead calls exit_oom_victim().

Patch "mm, oom: hide mm which is shared with kthread or global init"
tried to avoid OOM livelock by setting MMF_OOM_REAPED, but it is racy
because setting MMF_OOM_REAPED will not help when find_lock_task_mm()
in oom_scan_process_thread() failed.

It is possible (though unlikely) that a !can_oom_reap TIF_MEMDIE thread
becomes the only user of that mm (i.e. mm->mm_users drops to 1) and is
later blocked for unbounded period at __mmput() from mmput() from
exit_mm() from do_exit() when we hit e.g.

  (1) First round of OOM killer invocation starts.
  (2) select_bad_process() chooses P1 as an OOM victim because
      oom_scan_process_thread() does not find existing victims.
  (3) oom_kill_process() sets TIF_MEMDIE on P1, but does not put P1 under
      the OOM reaper's supervision due to (p->flags & PF_KTHREAD) being
      true, and instead sets MMF_OOM_REAPED on the P1's mm.
  (4) First round of OOM killer invocation finishes.
  (5) P1 is unable to arrive at do_exit() due to being blocked at
      unkillable event waiting for somebody else's memory allocation.
  (6) Second round of OOM killer invocation starts.
  (7) select_bad_process() chooses P2 as an OOM victim because
      oom_scan_process_thread() finds P1's mm with MMF_OOM_REAPED set.
  (8) oom_kill_process() sets TIF_MEMDIE on P2 via mark_oom_victim(),
      and puts P2 under the OOM reaper's supervision due to
      (p->flags & PF_KTHREAD) being false.
  (9) Second round of OOM killer invocation finishes.
  (10) The OOM reaper reaps P2's mm, and sets MMF_OOM_REAPED to
       P2's mm, and clears TIF_MEMDIE from P2.
  (11) Regarding P1's mm, (p->flags & PF_KTHREAD) becomes false because
       somebody else's memory allocation succeeds and unuse_mm(P1->mm)
       is called. At this point P1 becomes the only user of P1->mm.
  (12) P1 arrives at do_exit() due to no longer being blocked at
       unkillable event waiting for somebody else's memory allocation.
  (13) P1 reaches P1->mm = NULL line in exit_mm() from do_exit().
  (14) P1 is blocked at __mmput().
  (15) Third round of OOM killer invocation starts.
  (16) select_bad_process() does not choose new OOM victim because
       oom_scan_process_thread() fails to find P1's mm while
       P1->signal->oom_victims > 0.
  (17) Third round of OOM killer invocation finishes.
  (18) OOM livelock happens because nobody will clear TIF_MEMDIE from
       P1 (and decrement P1->signal->oom_victims) while P1 is blocked
       at __mmput().

sequence.

Thus, we must not depend on find_lock_task_mm() not returning NULL.
We must use a raceless method which allows oom_scan_process_thread()
to select next OOM victim no matter when TIF_MEMDIE thread passed
tsk->mm = NULL line and is blocked for unbounded period at __mmput() from
mmput() from exit_mm() from do_exit().

In order to make atomic_read(&task->signal->oom_victims) == 0,
oom_kill_process() needs to call exit_oom_victim() when we don't put
the TIF_MEMDIE thread under the OOM reaper's supervision, as with
oom_reap_task() needs to call exit_oom_victim() when the OOM reaper
failed to reap memory due to find_lock_task_mm() in __oom_reap_task()
returning NULL.

On the other hand, calling exit_oom_victim() on remote TIF_MEMDIE thread
is racy with oom_killer_disable() synchronization. It would be possible to
call try_to_freeze_tasks(true) after returning from oom_killer_disable(),
but it is not optimal. Also, like patch "mm, oom: task_will_free_mem
should skip oom_reaped tasks" showed, calling exit_oom_victim() on remote
TIF_MEMDIE thread tends to increase possibility of asking for
task_will_free_mem() shortcut in out_of_memory(). Also, calling
exit_oom_victim(current) from oom_kill_process() prevents current from
using ALLOC_NO_WATERMARKS via TIF_MEMDIE.

Therefore, we should consider using a raceless method which does not
depend on atomic_read(&task->signal->oom_victims) == 0 and
atomic_read(&oom_victims) == 0 in order to respectively allow
oom_scan_process_thread() to select next OOM victim and allow
oom_killer_disable() synchronization to act as a full "barrier".

This patch introduces "struct signal_struct"->oom_ignore_victims flag
and sets that flag instead of calling exit_oom_victim().
oom_killer_disable() can use wait_event_timeout() in case something
went terribly wrong.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: David Rientjes <rientjes@google.com>
---
 include/linux/sched.h |  1 +
 mm/oom_kill.c         | 30 +++++++++++-------------------
 2 files changed, 12 insertions(+), 19 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a1b4475..c5281da 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -801,6 +801,7 @@ struct signal_struct {
 	 * oom
 	 */
 	bool oom_flag_origin;
+	bool oom_ignore_victims;        /* Ignore oom_victims value */
 	short oom_score_adj;		/* OOM kill score adjustment */
 	short oom_score_adj_min;	/* OOM kill score adjustment min value.
 					 * Only settable by CAP_SYS_RESOURCE. */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4c21f74..aa030c8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -284,21 +284,12 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	/*
 	 * This task already has access to memory reserves and is being killed.
 	 * Don't allow any other task to have access to the reserves unless
-	 * the task has MMF_OOM_REAPED because chances that it would release
-	 * any memory is quite low.
+	 * the task has oom_ignore_victims which is set when we can't wait for
+	 * exit_oom_victim().
 	 */
-	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
-		struct task_struct *p = find_lock_task_mm(task);
-		enum oom_scan_t ret = OOM_SCAN_ABORT;
-
-		if (p) {
-			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
-				ret = OOM_SCAN_CONTINUE;
-			task_unlock(p);
-		}
-
-		return ret;
-	}
+	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims))
+		return task->signal->oom_ignore_victims ?
+			OOM_SCAN_CONTINUE : OOM_SCAN_ABORT;
 
 	/*
 	 * If task is allocating a lot of memory and has been marked to be
@@ -593,13 +584,13 @@ static void oom_reap_task(struct task_struct *tsk)
 	}
 
 	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore or it is not a good candidate
-	 * for the oom victim right now because it cannot release its memory
-	 * itself nor by the oom reaper.
+	 * Let oom_scan_process_thread() ignore this task because the task
+	 * shouldn't be sitting on a reasonably reclaimable memory anymore or
+	 * it is not a good candidate for the oom victim right now because it
+	 * cannot release its memory itself nor by the oom reaper.
 	 */
 	tsk->oom_reaper_list = NULL;
-	exit_oom_victim(tsk);
+	tsk->signal->oom_ignore_victims = true;
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
@@ -930,6 +921,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			 */
 			can_oom_reap = false;
 			set_bit(MMF_OOM_REAPED, &mm->flags);
+			victim->signal->oom_ignore_victims = true;
 			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
 					task_pid_nr(victim), victim->comm,
 					task_pid_nr(p), p->comm);
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-24 11:02 [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE Tetsuo Handa
@ 2016-06-24 12:39 ` Michal Hocko
  2016-06-24 15:54   ` Tetsuo Handa
  2016-06-24 21:56 ` Oleg Nesterov
  1 sibling, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2016-06-24 12:39 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, Oleg Nesterov, Vladimir Davydov, David Rientjes

On Fri 24-06-16 20:02:01, Tetsuo Handa wrote:
> Currently, the OOM reaper calls exit_oom_victim() on remote TIF_MEMDIE
> thread after an OOM reap attempt was made. This behavior is intended
> for allowing oom_scan_process_thread() to select next OOM victim by
> making atomic_read(&task->signal->oom_victims) == 0.
> 
> But since threads can be blocked for unbounded period at __mmput() from
> mmput() from exit_mm() from do_exit(), we can't risk the OOM reaper
> being blocked for unbounded period waiting for TIF_MEMDIE threads.
> Therefore, when we hit a situation that a TIF_MEMDIE thread which is
> the only thread of that thread group reached tsk->mm = NULL line in
> exit_mm() from do_exit() before __oom_reap_task() finds a mm via
> find_lock_task_mm(), oom_reap_task() does not wait for the TIF_MEMDIE
> thread to return from __mmput() and instead calls exit_oom_victim().
> 
> Patch "mm, oom: hide mm which is shared with kthread or global init"
> tried to avoid OOM livelock by setting MMF_OOM_REAPED, but it is racy
> because setting MMF_OOM_REAPED will not help when find_lock_task_mm()
> in oom_scan_process_thread() failed.

I haven't thought that through yet (I will wait for the monday fresh
brain) but wouldn't the following be sufficient?
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4c21f744daa6..72360d7284a6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -295,7 +295,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
 				ret = OOM_SCAN_CONTINUE;
 			task_unlock(p);
-		}
+		} else if (task->state == EXIT_ZOMBIE)
+			ret = OOM_SCAN_CONTINUE;
 
 		return ret;
 	}
@@ -592,14 +593,7 @@ static void oom_reap_task(struct task_struct *tsk)
 		debug_show_all_locks();
 	}
 
-	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore or it is not a good candidate
-	 * for the oom victim right now because it cannot release its memory
-	 * itself nor by the oom reaper.
-	 */
 	tsk->oom_reaper_list = NULL;
-	exit_oom_victim(tsk);
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-24 12:39 ` Michal Hocko
@ 2016-06-24 15:54   ` Tetsuo Handa
  2016-06-24 22:42     ` Oleg Nesterov
  0 siblings, 1 reply; 34+ messages in thread
From: Tetsuo Handa @ 2016-06-24 15:54 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, oleg, vdavydov, rientjes

Michal Hocko wrote:
> On Fri 24-06-16 20:02:01, Tetsuo Handa wrote:
> > Currently, the OOM reaper calls exit_oom_victim() on remote TIF_MEMDIE
> > thread after an OOM reap attempt was made. This behavior is intended
> > for allowing oom_scan_process_thread() to select next OOM victim by
> > making atomic_read(&task->signal->oom_victims) == 0.
> > 
> > But since threads can be blocked for unbounded period at __mmput() from
> > mmput() from exit_mm() from do_exit(), we can't risk the OOM reaper
> > being blocked for unbounded period waiting for TIF_MEMDIE threads.
> > Therefore, when we hit a situation that a TIF_MEMDIE thread which is
> > the only thread of that thread group reached tsk->mm = NULL line in
> > exit_mm() from do_exit() before __oom_reap_task() finds a mm via
> > find_lock_task_mm(), oom_reap_task() does not wait for the TIF_MEMDIE
> > thread to return from __mmput() and instead calls exit_oom_victim().
> > 
> > Patch "mm, oom: hide mm which is shared with kthread or global init"
> > tried to avoid OOM livelock by setting MMF_OOM_REAPED, but it is racy
> > because setting MMF_OOM_REAPED will not help when find_lock_task_mm()
> > in oom_scan_process_thread() failed.
> 
> I haven't thought that through yet (I will wait for the monday fresh
> brain) but wouldn't the following be sufficient?
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 4c21f744daa6..72360d7284a6 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -295,7 +295,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
>  				ret = OOM_SCAN_CONTINUE;
>  			task_unlock(p);
> -		}
> +		} else if (task->state == EXIT_ZOMBIE)
> +			ret = OOM_SCAN_CONTINUE;

I think EXIT_ZOMBIE is too late, for it is exit_notify() stage from do_exit()
which sets EXIT_ZOMBIE state.

The stage my patch is trying to rescue is __mmput() from mmput() from
exit_mm() from do_exit(). It is something like doing

----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4c21f74..f1f892e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -289,11 +289,11 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 */
 	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
 		struct task_struct *p = find_lock_task_mm(task);
-		enum oom_scan_t ret = OOM_SCAN_ABORT;
+		enum oom_scan_t ret = OOM_SCAN_CONTINUE;
 
 		if (p) {
-			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
-				ret = OOM_SCAN_CONTINUE;
+			if (!test_bit(MMF_OOM_REAPED, &p->mm->flags))
+				ret = OOM_SCAN_ABORT;
 			task_unlock(p);
 		}
 
----------

which is effectively nearly equivalent with doing

----------
diff --git a/kernel/exit.c b/kernel/exit.c
index 84ae830..ea188a7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -509,9 +509,9 @@ static void exit_mm(struct task_struct *tsk)
 	enter_lazy_tlb(mm, current);
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
-	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
 		exit_oom_victim(tsk);
+	mmput(mm);
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
----------

because we don't want to risk the OOM killer wait forever when __mmput() from
mmput() from exit_mm() from do_exit() is blocked at memory allocation.

If we had some timer or timeout, we can call

 	if (test_thread_flag(TIF_MEMDIE))
 		exit_oom_victim(tsk);

when __mmput() from mmput() from exit_mm() from do_exit() is blocked for
too long. But since we don't put !can_oom_reap TIF_MEMDIE thread under the
OOM reaper's supervision, we need to do something equivalent to calling
exit_oom_victim(tsk) early, at the cost of increasing possibility of
needlessly selecting next OOM victim.

>  
>  		return ret;
>  	}
> @@ -592,14 +593,7 @@ static void oom_reap_task(struct task_struct *tsk)
>  		debug_show_all_locks();
>  	}
>  
> -	/*
> -	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> -	 * reasonably reclaimable memory anymore or it is not a good candidate
> -	 * for the oom victim right now because it cannot release its memory
> -	 * itself nor by the oom reaper.
> -	 */
>  	tsk->oom_reaper_list = NULL;
> -	exit_oom_victim(tsk);
>  
>  	/* Drop a reference taken by wake_oom_reaper */
>  	put_task_struct(tsk);
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-24 11:02 [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE Tetsuo Handa
  2016-06-24 12:39 ` Michal Hocko
@ 2016-06-24 21:56 ` Oleg Nesterov
  2016-06-25  5:44   ` Tetsuo Handa
  1 sibling, 1 reply; 34+ messages in thread
From: Oleg Nesterov @ 2016-06-24 21:56 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, linux-mm, Michal Hocko, Vladimir Davydov, David Rientjes

Since I mentioned TIF_MEMDIE in another thread, I simply can't resist.
Sorry for grunting.

On 06/24, Tetsuo Handa wrote:
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -801,6 +801,7 @@ struct signal_struct {
>  	 * oom
>  	 */
>  	bool oom_flag_origin;
> +	bool oom_ignore_victims;        /* Ignore oom_victims value */
>  	short oom_score_adj;		/* OOM kill score adjustment */
>  	short oom_score_adj_min;	/* OOM kill score adjustment min value.
>  					 * Only settable by CAP_SYS_RESOURCE. */

Yet another kludge to fix yet another problem with TIF_MEMDIE. Not
to mention that that wh

Can't we state the fact TIF_MEMDIE is just broken? The very idea imo.
I am starting to seriously think we should kill this flag, fix the
compilation errors, remove the dead code (including the oom_victims
logic), and then try to add something else. Say, even MMF_MEMDIE looks
better although I understand it is not that simple.

Just one question. Why do we need this bit outside of oom-kill.c? It
affects page_alloc.c and this probably makes sense. But who get this
flag when we decide to kill the memory hog? A random thread foung by
find_lock_task_mm(), iow a random thread with ->mm != NULL, likely the
group leader. This simply can not be right no matter what.



And in any case I don't understand this patch but I have to admit that
I failed to force myself to read the changelog and the actual change ;)
In any case I agree that we should not set MMF_MEMDIE if ->mm == NULL,
and if we ensure this then I do not understand why we can't rely on
MMF_OOM_REAPED. Ignoring the obvious races, if ->oom_victims != 0 then
find_lock_task_mm() should succed.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-24 15:54   ` Tetsuo Handa
@ 2016-06-24 22:42     ` Oleg Nesterov
  0 siblings, 0 replies; 34+ messages in thread
From: Oleg Nesterov @ 2016-06-24 22:42 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, linux-mm, vdavydov, rientjes

On 06/25, Tetsuo Handa wrote:
>
> Michal Hocko wrote:
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -295,7 +295,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
> >  			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
> >  				ret = OOM_SCAN_CONTINUE;
> >  			task_unlock(p);
> > -		}
> > +		} else if (task->state == EXIT_ZOMBIE)
                                 ^^^^^

you meant exit_state ;)

> > +			ret = OOM_SCAN_CONTINUE;
>
> I think EXIT_ZOMBIE is too late, for it is exit_notify() stage from do_exit()
> which sets EXIT_ZOMBIE state.

Yes, and in any case nobody but exit/wait/ptrace code should ever look
at ->exit_state, not to mention EXIT_ZOMBIE/DEAD/WHATEVER codes.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-24 21:56 ` Oleg Nesterov
@ 2016-06-25  5:44   ` Tetsuo Handa
  2016-06-27  9:23     ` Michal Hocko
  2016-06-27 20:40     ` Oleg Nesterov
  0 siblings, 2 replies; 34+ messages in thread
From: Tetsuo Handa @ 2016-06-25  5:44 UTC (permalink / raw)
  To: oleg; +Cc: mhocko, linux-mm, mhocko, vdavydov, rientjes

Oleg Nesterov wrote:
> Since I mentioned TIF_MEMDIE in another thread, I simply can't resist.
> Sorry for grunting.
> 
> On 06/24, Tetsuo Handa wrote:
> >
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -801,6 +801,7 @@ struct signal_struct {
> >  	 * oom
> >  	 */
> >  	bool oom_flag_origin;
> > +	bool oom_ignore_victims;        /* Ignore oom_victims value */
> >  	short oom_score_adj;		/* OOM kill score adjustment */
> >  	short oom_score_adj_min;	/* OOM kill score adjustment min value.
> >  					 * Only settable by CAP_SYS_RESOURCE. */
> 
> Yet another kludge to fix yet another problem with TIF_MEMDIE. Not
> to mention that that wh
> 
> Can't we state the fact TIF_MEMDIE is just broken? The very idea imo.

Yes. TIF_MEMDIE is a trouble maker.

Setting TIF_MEMDIE is per task_struct operation.
Sending SIGKILL is per signal_struct operation.
OOM killer is per mm_struct operation.

> I am starting to seriously think we should kill this flag, fix the
> compilation errors, remove the dead code (including the oom_victims
> logic), and then try to add something else. Say, even MMF_MEMDIE looks
> better although I understand it is not that simple.

I wish that TIF_MEMDIE is per signal_struct flag. But since we allow
mm-less TIF_MEMDIE thread to use ALLOC_NO_WATERMARKS via TIF_MEMDIE
inside __mmput() from mmput() from exit_mm() from do_exit(), we can't
replace

  test_thread_flag(TIF_MEMDIE)

in gfp_to_alloc_flags() with

  current->signal->oom_killed

or

  current->mm && (current->mm->flags & MMF_MEMDIE)

. But

> 
> Just one question. Why do we need this bit outside of oom-kill.c? It
> affects page_alloc.c and this probably makes sense. But who get this
> flag when we decide to kill the memory hog? A random thread foung by
> find_lock_task_mm(), iow a random thread with ->mm != NULL, likely the
> group leader. This simply can not be right no matter what.

I agree that setting TIF_MEMDIE to only first ->mm != NULL thread
does not make sense.

I've proposed setting TIF_MEMDIE to all ->mm != NULL threads which are
killed by the OOM killer because doing so won't increase the risk of
depleting the memory reserves, for TIF_MEMDIE helps only if that thread is
doing memory allocation
( http://lkml.kernel.org/r/1458529634-5951-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp ),
but it did not happen.

> 
> And in any case I don't understand this patch but I have to admit that
> I failed to force myself to read the changelog and the actual change ;)
> In any case I agree that we should not set MMF_MEMDIE if ->mm == NULL,
> and if we ensure this then I do not understand why we can't rely on
> MMF_OOM_REAPED. Ignoring the obvious races, if ->oom_victims != 0 then
> find_lock_task_mm() should succed.

Since we are using

  mm = current->mm;
  current->mm = NULL;
  __mmput(mm); (may block for unbounded period waiting for somebody else's memory allocation)
  exit_oom_victim(current);

sequence, we won't be able to make find_lock_task_mm(tsk) != NULL when
tsk->signal->oom_victims != 0 unless we change this sequence.
My patch tries to rescue it using tsk->signal->oom_ignore_victims flag.

> 
> Oleg.
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-25  5:44   ` Tetsuo Handa
@ 2016-06-27  9:23     ` Michal Hocko
  2016-06-27 10:36       ` Michal Hocko
  2016-06-27 21:09       ` Oleg Nesterov
  2016-06-27 20:40     ` Oleg Nesterov
  1 sibling, 2 replies; 34+ messages in thread
From: Michal Hocko @ 2016-06-27  9:23 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: oleg, linux-mm, vdavydov, rientjes

On Sat 25-06-16 14:44:39, Tetsuo Handa wrote:
> Oleg Nesterov wrote:
> > Since I mentioned TIF_MEMDIE in another thread, I simply can't resist.
> > Sorry for grunting.
> > 
> > On 06/24, Tetsuo Handa wrote:
> > >
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -801,6 +801,7 @@ struct signal_struct {
> > >  	 * oom
> > >  	 */
> > >  	bool oom_flag_origin;
> > > +	bool oom_ignore_victims;        /* Ignore oom_victims value */
> > >  	short oom_score_adj;		/* OOM kill score adjustment */
> > >  	short oom_score_adj_min;	/* OOM kill score adjustment min value.
> > >  					 * Only settable by CAP_SYS_RESOURCE. */
> > 
> > Yet another kludge to fix yet another problem with TIF_MEMDIE. Not
> > to mention that that wh
> > 
> > Can't we state the fact TIF_MEMDIE is just broken? The very idea imo.
> 
> Yes. TIF_MEMDIE is a trouble maker.
> 
> Setting TIF_MEMDIE is per task_struct operation.
> Sending SIGKILL is per signal_struct operation.
> OOM killer is per mm_struct operation.

Yes this is really unfortunate. I am trying to converge to per mm
behavior as much as possible. We are getting there slowly but not yet
there.

[...]
> > Just one question. Why do we need this bit outside of oom-kill.c? It
> > affects page_alloc.c and this probably makes sense. But who get this
> > flag when we decide to kill the memory hog? A random thread foung by
> > find_lock_task_mm(), iow a random thread with ->mm != NULL, likely the
> > group leader. This simply can not be right no matter what.
> 
> I agree that setting TIF_MEMDIE to only first ->mm != NULL thread
> does not make sense.

Well the idea was that other threads will get TIF_MEMDIE if they need to
allocate and the initial thread (usually the group leader) will hold off
any other oom killing until it gets past its mmput. So the flag acts
both as memory reserve access key and the exclusion. I am not sure
setting the flag to all threads in the same thread group would help all
that much. Processes sharing the mm outside of the thread group should
behave in a similar way. The general reluctance to give access to all
threads was to prevent from thundering herd effect which is more likely
that way.

[...]

> > And in any case I don't understand this patch but I have to admit that
> > I failed to force myself to read the changelog and the actual change ;)
> > In any case I agree that we should not set MMF_MEMDIE if ->mm == NULL,
> > and if we ensure this then I do not understand why we can't rely on
> > MMF_OOM_REAPED. Ignoring the obvious races, if ->oom_victims != 0 then
> > find_lock_task_mm() should succed.
> 
> Since we are using
> 
>   mm = current->mm;
>   current->mm = NULL;
>   __mmput(mm); (may block for unbounded period waiting for somebody else's memory allocation)
>   exit_oom_victim(current);
> 
> sequence, we won't be able to make find_lock_task_mm(tsk) != NULL when
> tsk->signal->oom_victims != 0 unless we change this sequence.
> My patch tries to rescue it using tsk->signal->oom_ignore_victims flag.

I was thinking about this some more and I think that a better approach
would be to not forget the mm during the exit. The whole find_lock_task_mm
sounds like a workaround than a real solution. I am trying to understand
why do we really have to reset the current->mm to NULL during the exit.
If we cannot change this then we can at least keep a stable mm
somewhere. The code would get so much easier that way.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-27  9:23     ` Michal Hocko
@ 2016-06-27 10:36       ` Michal Hocko
  2016-06-27 15:51         ` Oleg Nesterov
  2016-06-27 21:09       ` Oleg Nesterov
  1 sibling, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2016-06-27 10:36 UTC (permalink / raw)
  To: oleg, Tetsuo Handa; +Cc: linux-mm, vdavydov, rientjes

Oleg,

On Mon 27-06-16 11:23:26, Michal Hocko wrote:
[...]
> I was thinking about this some more and I think that a better approach
> would be to not forget the mm during the exit. The whole find_lock_task_mm
> sounds like a workaround than a real solution. I am trying to understand
> why do we really have to reset the current->mm to NULL during the exit.
> If we cannot change this then we can at least keep a stable mm
> somewhere. The code would get so much easier that way.

I am trying to wrap my head around active_mm semantic. It is not
reset on the exit and it should refer the to same mm pointer for regular
processes AFAICS. [1] mentions that even regular processes can borrow an
anonymous address space. It used to be bdflush but this doesn't seem to
be the case anymore AFAICS.

Can we (ab)use it for places where we know we are dealing with OOM
victims (aka regular processes) and replace the find_task_lock_mm by
active_mm? I mean something like (not even compile tested so most
probably incomplete as well) the below?

We would have to drop the last reference to the mm later but that
shouldn't pin much memory. If this works as intended then we can
reliably use per-mm heuristics to break out of the oom lockup.

[1] https://www.kernel.org/doc/Documentation/vm/active_mm.txt
---
diff --git a/kernel/fork.c b/kernel/fork.c
index 452fc864f2f6..6198371573c7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -237,6 +237,8 @@ void free_task(struct task_struct *tsk)
 	ftrace_graph_exit_task(tsk);
 	put_seccomp_filter(tsk);
 	arch_release_task_struct(tsk);
+	if (tsk->active_mm)
+		mmdrop(tsk->active_mm);
 	free_task_struct(tsk);
 }
 EXPORT_SYMBOL(free_task);
@@ -1022,6 +1024,8 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 good_mm:
 	tsk->mm = mm;
 	tsk->active_mm = mm;
+	/* to be release in the final task_put */
+	atomic_inc(&mm->mm_count);
 	return 0;
 
 fail_nomem:
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4c21f744daa6..18282423e7cb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -166,13 +166,14 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 {
 	long points;
 	long adj;
+	struct mm_struct *mm;
 
 	if (oom_unkillable_task(p, memcg, nodemask))
 		return 0;
 
+	task_lock(p);
 	p = find_lock_task_mm(p);
-	if (!p)
-		return 0;
+	mm = p->active_mm;
 
 	/*
 	 * Do not even consider tasks which are explicitly marked oom
@@ -181,7 +182,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	 */
 	adj = (long)p->signal->oom_score_adj;
 	if (adj == OOM_SCORE_ADJ_MIN ||
-			test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
+			test_bit(MMF_OOM_REAPED, &mm->flags) ||
 			in_vfork(p)) {
 		task_unlock(p);
 		return 0;
@@ -191,8 +192,8 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	 * The baseline for the badness score is the proportion of RAM that each
 	 * task's rss, pagetable and swap space use.
 	 */
-	points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
-		atomic_long_read(&p->mm->nr_ptes) + mm_nr_pmds(p->mm);
+	points = get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS) +
+		atomic_long_read(&mm->nr_ptes) + mm_nr_pmds(mm);
 	task_unlock(p);
 
 	/*
@@ -288,14 +289,12 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 * any memory is quite low.
 	 */
 	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
-		struct task_struct *p = find_lock_task_mm(task);
 		enum oom_scan_t ret = OOM_SCAN_ABORT;
 
-		if (p) {
-			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
-				ret = OOM_SCAN_CONTINUE;
-			task_unlock(p);
-		}
+		task_lock(task);
+		if (test_bit(MMF_OOM_REAPED, &task->active_mm->flags))
+			ret = OOM_SCAN_CONTINUE;
+		task_unlock(task);
 
 		return ret;
 	}
@@ -367,32 +366,25 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
 static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask)
 {
 	struct task_struct *p;
-	struct task_struct *task;
 
 	pr_info("[ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name\n");
 	rcu_read_lock();
 	for_each_process(p) {
+		struct mm_struct *mm;
 		if (oom_unkillable_task(p, memcg, nodemask))
 			continue;
 
-		task = find_lock_task_mm(p);
-		if (!task) {
-			/*
-			 * This is a kthread or all of p's threads have already
-			 * detached their mm's.  There's no need to report
-			 * them; they can't be oom killed anyway.
-			 */
-			continue;
-		}
+		task_lock(p);
+		mm = p->mm;
 
 		pr_info("[%5d] %5d %5d %8lu %8lu %7ld %7ld %8lu         %5hd %s\n",
-			task->pid, from_kuid(&init_user_ns, task_uid(task)),
-			task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
-			atomic_long_read(&task->mm->nr_ptes),
-			mm_nr_pmds(task->mm),
-			get_mm_counter(task->mm, MM_SWAPENTS),
-			task->signal->oom_score_adj, task->comm);
-		task_unlock(task);
+			p->pid, from_kuid(&init_user_ns, task_uid(p)),
+			p->tgid, mm->total_vm, get_mm_rss(mm),
+			atomic_long_read(&mm->nr_ptes),
+			mm_nr_pmds(mm),
+			get_mm_counter(mm, MM_SWAPENTS),
+			p->signal->oom_score_adj, p->comm);
+		task_unlock(p);
 	}
 	rcu_read_unlock();
 }
@@ -457,7 +449,6 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	struct mm_struct *mm = NULL;
-	struct task_struct *p;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
 	bool ret = true;
@@ -484,12 +475,10 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 * We might have race with exit path so consider our work done if there
 	 * is no mm.
 	 */
-	p = find_lock_task_mm(tsk);
-	if (!p)
-		goto unlock_oom;
-	mm = p->mm;
+	task_lock(tsk);
+	mm = tsk->active_mm;
 	atomic_inc(&mm->mm_count);
-	task_unlock(p);
+	task_unlock(tsk);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
@@ -553,7 +542,6 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	mmput_async(mm);
 mm_drop:
 	mmdrop(mm);
-unlock_oom:
 	mutex_unlock(&oom_lock);
 	return ret;
 }
@@ -579,15 +567,13 @@ static void oom_reap_task(struct task_struct *tsk)
 		 * so hide the mm from the oom killer so that it can move on
 		 * to another task with a different mm struct.
 		 */
-		p = find_lock_task_mm(tsk);
-		if (p) {
-			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
-				pr_info("oom_reaper: giving up pid:%d (%s)\n",
-						task_pid_nr(tsk), tsk->comm);
-				set_bit(MMF_OOM_REAPED, &p->mm->flags);
-			}
-			task_unlock(p);
+		task_lock(tsk);
+		if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &tsk->active_mm->flags)) {
+			pr_info("oom_reaper: giving up pid:%d (%s)\n",
+					task_pid_nr(tsk), tsk->comm);
+			set_bit(MMF_OOM_REAPED, &tsk->active_mm->flags);
 		}
+		task_unlock(p);
 
 		debug_show_all_locks();
 	}
@@ -879,18 +865,9 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	}
 	read_unlock(&tasklist_lock);
 
-	p = find_lock_task_mm(victim);
-	if (!p) {
-		put_task_struct(victim);
-		return;
-	} else if (victim != p) {
-		get_task_struct(p);
-		put_task_struct(victim);
-		victim = p;
-	}
-
+	task_lock(victim);
 	/* Get a reference to safely compare mm after task_unlock(victim) */
-	mm = victim->mm;
+	mm = victim->active_mm;
 	atomic_inc(&mm->mm_count);
 	/*
 	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-27 10:36       ` Michal Hocko
@ 2016-06-27 15:51         ` Oleg Nesterov
  2016-06-27 16:06           ` Michal Hocko
  0 siblings, 1 reply; 34+ messages in thread
From: Oleg Nesterov @ 2016-06-27 15:51 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, linux-mm, vdavydov, rientjes

On 06/27, Michal Hocko wrote:
>
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -237,6 +237,8 @@ void free_task(struct task_struct *tsk)
>  	ftrace_graph_exit_task(tsk);
>  	put_seccomp_filter(tsk);
>  	arch_release_task_struct(tsk);
> +	if (tsk->active_mm)
> +		mmdrop(tsk->active_mm);
>  	free_task_struct(tsk);
>  }
>  EXPORT_SYMBOL(free_task);
> @@ -1022,6 +1024,8 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
>  good_mm:
>  	tsk->mm = mm;
>  	tsk->active_mm = mm;
> +	/* to be release in the final task_put */
> +	atomic_inc(&mm->mm_count);
>  	return 0;

No, I don't think this can work.

Note that tsk->active_mm in free_task() points to the random mm "borrowed"
from the previous/random task in context_switch() if task->mm == NULL. This
is true for kthreads and for the task which has already called exit_mm().

> -	p = find_lock_task_mm(tsk);
> -	if (!p)
> -		goto unlock_oom;
> -	mm = p->mm;
> +	task_lock(tsk);
> +	mm = tsk->active_mm;

The same. We can't know where this ->active_mm points to.

Just suppose that this tsk schedules after exit_mm(). When it gets CPU
again tsk->active_mm will point to ->mm of another task which in turns
called schedule() to make this tsk active.

Yes I agree, it would be nice to remove find_lock_task_mm(). And in
fact it would be nice to kill task_struct->mm (but this needs a lot
of cleanups). We probably want signal_struct->mm, but this is a bit
complicated (locking).

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-27 15:51         ` Oleg Nesterov
@ 2016-06-27 16:06           ` Michal Hocko
  2016-06-27 17:55             ` Oleg Nesterov
  0 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2016-06-27 16:06 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Tetsuo Handa, linux-mm, vdavydov, rientjes

On Mon 27-06-16 17:51:20, Oleg Nesterov wrote:
> On 06/27, Michal Hocko wrote:
> >
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -237,6 +237,8 @@ void free_task(struct task_struct *tsk)
> >  	ftrace_graph_exit_task(tsk);
> >  	put_seccomp_filter(tsk);
> >  	arch_release_task_struct(tsk);
> > +	if (tsk->active_mm)
> > +		mmdrop(tsk->active_mm);
> >  	free_task_struct(tsk);
> >  }
> >  EXPORT_SYMBOL(free_task);
> > @@ -1022,6 +1024,8 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
> >  good_mm:
> >  	tsk->mm = mm;
> >  	tsk->active_mm = mm;
> > +	/* to be release in the final task_put */
> > +	atomic_inc(&mm->mm_count);
> >  	return 0;
> 
> No, I don't think this can work.
> 
> Note that tsk->active_mm in free_task() points to the random mm "borrowed"
> from the previous/random task in context_switch() if task->mm == NULL. This
> is true for kthreads and for the task which has already called exit_mm().

OK, I misread the code. I though we wouldn't passed that route
again. Anyway, back to the drawing board.

> 
> > -	p = find_lock_task_mm(tsk);
> > -	if (!p)
> > -		goto unlock_oom;
> > -	mm = p->mm;
> > +	task_lock(tsk);
> > +	mm = tsk->active_mm;
> 
> The same. We can't know where this ->active_mm points to.
> 
> Just suppose that this tsk schedules after exit_mm(). When it gets CPU
> again tsk->active_mm will point to ->mm of another task which in turns
> called schedule() to make this tsk active.
> 
> Yes I agree, it would be nice to remove find_lock_task_mm(). And in
> fact it would be nice to kill task_struct->mm (but this needs a lot
> of cleanups). We probably want signal_struct->mm, but this is a bit
> complicated (locking).

Is there any hard requirement to reset task_struct::mm in the first
place?

I mean I could have added oom_mm pointer into the task_struct and that
would guarantee that we always have a valid pointer when it is needed
but having yet another mm pointer there.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-27 16:06           ` Michal Hocko
@ 2016-06-27 17:55             ` Oleg Nesterov
  2016-06-28 10:19               ` Michal Hocko
  0 siblings, 1 reply; 34+ messages in thread
From: Oleg Nesterov @ 2016-06-27 17:55 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, linux-mm, vdavydov, rientjes

On 06/27, Michal Hocko wrote:
>
> On Mon 27-06-16 17:51:20, Oleg Nesterov wrote:
> >
> > Yes I agree, it would be nice to remove find_lock_task_mm(). And in
> > fact it would be nice to kill task_struct->mm (but this needs a lot
> > of cleanups). We probably want signal_struct->mm, but this is a bit
> > complicated (locking).
>
> Is there any hard requirement to reset task_struct::mm in the first
> place?

Well, at least the scheduler needs this. And we need to audit every
->mm != NULL check.

> I mean I could have added oom_mm pointer into the task_struct and that
> would guarantee that we always have a valid pointer when it is needed
> but having yet another mm pointer there.

and add another mmdrop(oom_mm) into free_task() ? This would be bad, we
do not want to delay __mmdrop()... Look, we even want to make the
free_thread_info() synchronous, so that we could free ->stack before the
final put_task_struct ;)

But could you remind why do you want this right now? I mean, the ability
to find ->mm with mm_count != 0 even if the user memory was already freed?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-25  5:44   ` Tetsuo Handa
  2016-06-27  9:23     ` Michal Hocko
@ 2016-06-27 20:40     ` Oleg Nesterov
  2016-06-28 10:29       ` Michal Hocko
  1 sibling, 1 reply; 34+ messages in thread
From: Oleg Nesterov @ 2016-06-27 20:40 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, linux-mm, mhocko, vdavydov, rientjes

On 06/25, Tetsuo Handa wrote:
>
> Oleg Nesterov wrote:
> > And in any case I don't understand this patch but I have to admit that
> > I failed to force myself to read the changelog and the actual change ;)
> > In any case I agree that we should not set MMF_MEMDIE if ->mm == NULL,
> > and if we ensure this then I do not understand why we can't rely on
> > MMF_OOM_REAPED. Ignoring the obvious races, if ->oom_victims != 0 then
> > find_lock_task_mm() should succed.
>
> Since we are using
>
>   mm = current->mm;
>   current->mm = NULL;
>   __mmput(mm); (may block for unbounded period waiting for somebody else's memory allocation)
>   exit_oom_victim(current);
>
> sequence, we won't be able to make find_lock_task_mm(tsk) != NULL when
> tsk->signal->oom_victims != 0 unless we change this sequence.

Ah, but this is clear, note the "Ignoring the obvious races" above.
Can't we fix this race? I am a bit lost, but iirc we want this anyway
to ensure that we do not set TIF_MEMDIE if ->mm == NULL ?

Hmm. Although I am not sure I really understand the "may block for
unbounded period ..." above. Do you mean khugepaged_exit?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-27  9:23     ` Michal Hocko
  2016-06-27 10:36       ` Michal Hocko
@ 2016-06-27 21:09       ` Oleg Nesterov
  2016-06-28 10:26         ` Michal Hocko
  1 sibling, 1 reply; 34+ messages in thread
From: Oleg Nesterov @ 2016-06-27 21:09 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, linux-mm, vdavydov, rientjes

On 06/27, Michal Hocko wrote:
>
> Yes this is really unfortunate. I am trying to converge to per mm
> behavior as much as possible. We are getting there slowly but not yet
> there.

Yes, agreed, everything should be per-mm.

Say wake_oom_reaper/oom_reap_task. It is simply ugly we pass task_struct
to oom_reap_task(), it should work with mm_struct. Again, this is because
of TIF_MEMDIE/exit_oom_victim.  Except pr_info(), but this is minor...

> So the flag acts
> both as memory reserve access key and the exclusion.

Yes, and this should be separeted imo.

As for memory reserve access, I feel that we should only set this flag
if task == current... but this needs more discussion.

> I am not sure
> setting the flag to all threads in the same thread group would help all
> that much. Processes sharing the mm outside of the thread group should
> behave in a similar way. The general reluctance to give access to all
> threads was to prevent from thundering herd effect which is more likely
> that way.

Agreed, that is why I said it is not that simple.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-27 17:55             ` Oleg Nesterov
@ 2016-06-28 10:19               ` Michal Hocko
  2016-06-29  0:13                 ` Oleg Nesterov
  2016-06-29 20:14                 ` Oleg Nesterov
  0 siblings, 2 replies; 34+ messages in thread
From: Michal Hocko @ 2016-06-28 10:19 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Tetsuo Handa, linux-mm, vdavydov, rientjes

On Mon 27-06-16 19:55:55, Oleg Nesterov wrote:
> On 06/27, Michal Hocko wrote:
> >
> > On Mon 27-06-16 17:51:20, Oleg Nesterov wrote:
> > >
> > > Yes I agree, it would be nice to remove find_lock_task_mm(). And in
> > > fact it would be nice to kill task_struct->mm (but this needs a lot
> > > of cleanups). We probably want signal_struct->mm, but this is a bit
> > > complicated (locking).
> >
> > Is there any hard requirement to reset task_struct::mm in the first
> > place?
> 
> Well, at least the scheduler needs this.

Could you point me to where it depends on that? I mean if we are past
exit_mm then we have unmapped the address space most probably but why
should we care about that in the scheduler? There shouldn't be any
further access to the address space by that point. I can see that
context_switch() checks task->mm but it should just work when it sees it
non NULL, right?

> And we need to audit every ->mm != NULL check.

Yes I have started looking and some of them would indeed need to be
updated. get_task_mm users are easily fixable because we can do
mmget_not_zero. Some others check ->mm just to be sure to not touch
kernel threads.

Do you think this would be a way to go, though? We would have to special
case this because the mm_struct is quite large (~900B with my config) so
we would keep and pin it only for oom victims.

> > I mean I could have added oom_mm pointer into the task_struct and that

sorry, meant to say s@task_struct@signal_struct@

> > would guarantee that we always have a valid pointer when it is needed
> > but having yet another mm pointer there.
> 
> and add another mmdrop(oom_mm) into free_task() ?

Well, I would bind it to the signal_struct life cycle. See the diff
below.

> This would be bad, we
> do not want to delay __mmdrop()... Look, we even want to make the
> free_thread_info() synchronous, so that we could free ->stack before the
> final put_task_struct ;)

Hmm, it is true that the mm_struct is quite large but that would be used
only when oom killed victims and they should release some memory so the
temporaly pinned mm shouldn't cause too much trouble.

> But could you remind why do you want this right now? I mean, the ability
> to find ->mm with mm_count != 0 even if the user memory was already freed?

I would like to drop exit_oom_victim() from oom_reap_task because that
causes other issues. It acts as a last resort to make sure that no
task will block the oom killer from selecting a new task for ever (see
oom_scan_process_thread) right now. That means I need to convey "skip
this task" somehow. mm_struct is ideal for that but we are losing it
during exit_mm while __mmput can block for an unbounded amount of time
and actually never reach exit_oom_victim right after that. I would like
to make oom_scan_process_thread robust enough to not care about 
unreachable exit_oom_victim as far as we know that the memory was freed
or at least attempted to do so. This will make the logic much more
simpler because we no longer have to think about the oom victim and its
state anymore and only rely on the oom reaping.

Does that make sense to you?

I haven't tested this at all so please take it only as a dump of my
current thinking.
---
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5bc0457ee3a8..10f6f42921f9 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -70,7 +70,7 @@ static inline bool oom_task_origin(const struct task_struct *p)
 	return p->signal->oom_flag_origin;
 }
 
-extern void mark_oom_victim(struct task_struct *tsk);
+extern void mark_oom_victim(struct task_struct *tsk, struct mm_struct *mm);
 
 #ifdef CONFIG_MMU
 extern void wake_oom_reaper(struct task_struct *tsk);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d81a1eb974a..befdcc1cde3c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -793,6 +793,8 @@ struct signal_struct {
 	short oom_score_adj;		/* OOM kill score adjustment */
 	short oom_score_adj_min;	/* OOM kill score adjustment min value.
 					 * Only settable by CAP_SYS_RESOURCE. */
+	struct mm_struct *oom_mm;	/* recorded mm when the thread group got
+					 * killed by the oom killer */
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
diff --git a/kernel/fork.c b/kernel/fork.c
index 452fc864f2f6..2bd3cc73d103 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -245,6 +245,8 @@ static inline void free_signal_struct(struct signal_struct *sig)
 {
 	taskstats_tgid_free(sig);
 	sched_autogroup_exit(sig);
+	if (sig->oom_mm)
+		mmdrop(sig->oom_mm);
 	kmem_cache_free(signal_cachep, sig);
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 40dfca3ef4bb..e9fe52d95a15 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1235,7 +1235,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * quickly exit and free its memory.
 	 */
 	if (task_will_free_mem(current)) {
-		mark_oom_victim(current);
+		mark_oom_victim(current, current->mm);
 		wake_oom_reaper(current);
 		goto unlock;
 	}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4c21f744daa6..bf62c50f8c65 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -286,16 +286,17 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 * Don't allow any other task to have access to the reserves unless
 	 * the task has MMF_OOM_REAPED because chances that it would release
 	 * any memory is quite low.
+	 * MMF_OOM_NOT_REAPABLE means that the oom_reaper backed off last time
+	 * so let it try again.
 	 */
 	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
-		struct task_struct *p = find_lock_task_mm(task);
+		struct mm_struct *mm = task->signal->oom_mm;
 		enum oom_scan_t ret = OOM_SCAN_ABORT;
 
-		if (p) {
-			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
-				ret = OOM_SCAN_CONTINUE;
-			task_unlock(p);
-		}
+		if (test_bit(MMF_OOM_REAPED, &mm->flags))
+			ret = OOM_SCAN_CONTINUE;
+		else if (test_bit(MMF_OOM_NOT_REAPABLE, &mm->flags))
+			ret = OOM_SCAN_SELECT;
 
 		return ret;
 	}
@@ -457,7 +458,6 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	struct mm_struct *mm = NULL;
-	struct task_struct *p;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
 	bool ret = true;
@@ -478,22 +478,10 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 */
 	mutex_lock(&oom_lock);
 
-	/*
-	 * Make sure we find the associated mm_struct even when the particular
-	 * thread has already terminated and cleared its mm.
-	 * We might have race with exit path so consider our work done if there
-	 * is no mm.
-	 */
-	p = find_lock_task_mm(tsk);
-	if (!p)
-		goto unlock_oom;
-	mm = p->mm;
-	atomic_inc(&mm->mm_count);
-	task_unlock(p);
-
+	mm = tsk->signal->oom_mm;
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
-		goto mm_drop;
+		goto unlock_oom;
 	}
 
 	/*
@@ -503,7 +491,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 */
 	if (!mmget_not_zero(mm)) {
 		up_read(&mm->mmap_sem);
-		goto mm_drop;
+		goto unlock_oom;
 	}
 
 	tlb_gather_mmu(&tlb, mm, 0, -1);
@@ -551,8 +539,6 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 * put the oom_reaper out of the way.
 	 */
 	mmput_async(mm);
-mm_drop:
-	mmdrop(mm);
 unlock_oom:
 	mutex_unlock(&oom_lock);
 	return ret;
@@ -568,7 +554,7 @@ static void oom_reap_task(struct task_struct *tsk)
 		schedule_timeout_idle(HZ/10);
 
 	if (attempts > MAX_OOM_REAP_RETRIES) {
-		struct task_struct *p;
+		struct mm_struct *mm;
 
 		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 				task_pid_nr(tsk), tsk->comm);
@@ -579,27 +565,17 @@ static void oom_reap_task(struct task_struct *tsk)
 		 * so hide the mm from the oom killer so that it can move on
 		 * to another task with a different mm struct.
 		 */
-		p = find_lock_task_mm(tsk);
-		if (p) {
-			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
-				pr_info("oom_reaper: giving up pid:%d (%s)\n",
-						task_pid_nr(tsk), tsk->comm);
-				set_bit(MMF_OOM_REAPED, &p->mm->flags);
-			}
-			task_unlock(p);
+		mm = tsk->signal->oom_mm;
+		if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) {
+			pr_info("oom_reaper: giving up pid:%d (%s)\n",
+					task_pid_nr(tsk), tsk->comm);
+			set_bit(MMF_OOM_REAPED, &mm->flags);
 		}
 
 		debug_show_all_locks();
 	}
 
-	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore or it is not a good candidate
-	 * for the oom victim right now because it cannot release its memory
-	 * itself nor by the oom reaper.
-	 */
 	tsk->oom_reaper_list = NULL;
-	exit_oom_victim(tsk);
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
@@ -661,17 +637,29 @@ subsys_initcall(oom_init)
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
+ * @mm: tsk's mm
  *
  * Has to be called with oom_lock held and never after
  * oom has been disabled already.
+ *
+ * mm has to be non-NULL. We are not checking it in this function because
+ * races might have caused tsk->mm becoming NULL.
  */
-void mark_oom_victim(struct task_struct *tsk)
+void mark_oom_victim(struct task_struct *tsk, struct mm_struct *mm)
 {
 	WARN_ON(oom_killer_disabled);
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
+
 	atomic_inc(&tsk->signal->oom_victims);
+
+	/* oom_mm is bound to the signal struct life time */
+	if (!tsk->signal->oom_mm) {
+		atomic_inc(&mm->mm_count);
+		tsk->signal->oom_mm = mm;
+	}
+
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
@@ -828,7 +816,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	struct task_struct *victim = p;
 	struct task_struct *child;
 	struct task_struct *t;
-	struct mm_struct *mm;
+	struct mm_struct *mm = READ_ONCE(p->mm);
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
@@ -838,8 +826,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	if (task_will_free_mem(p)) {
-		mark_oom_victim(p);
+	if (mm && task_will_free_mem(p)) {
+		mark_oom_victim(p, mm);
 		wake_oom_reaper(p);
 		put_task_struct(p);
 		return;
@@ -898,7 +886,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * space under its control.
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
-	mark_oom_victim(victim);
+	mark_oom_victim(victim, mm);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -1019,7 +1007,7 @@ bool out_of_memory(struct oom_control *oc)
 	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
 	if (current->mm && task_will_free_mem(current)) {
-		mark_oom_victim(current);
+		mark_oom_victim(current, current->mm);
 		wake_oom_reaper(current);
 		return true;
 	}
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-27 21:09       ` Oleg Nesterov
@ 2016-06-28 10:26         ` Michal Hocko
  2016-06-29 19:34           ` Oleg Nesterov
  0 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2016-06-28 10:26 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Tetsuo Handa, linux-mm, vdavydov, rientjes

On Mon 27-06-16 23:09:04, Oleg Nesterov wrote:
> On 06/27, Michal Hocko wrote:
> >
> > Yes this is really unfortunate. I am trying to converge to per mm
> > behavior as much as possible. We are getting there slowly but not yet
> > there.
> 
> Yes, agreed, everything should be per-mm.
> 
> Say wake_oom_reaper/oom_reap_task. It is simply ugly we pass task_struct
> to oom_reap_task(), it should work with mm_struct. Again, this is because
> of TIF_MEMDIE/exit_oom_victim.  Except pr_info(), but this is minor...

I was also tempted to get back to the mm based queing but I think that
the pr_info is quite useful. Both the back off and the successful
reaping can tell us more about how all the machinery works.
 
> > So the flag acts
> > both as memory reserve access key and the exclusion.
> 
> Yes, and this should be separeted imo.

I would love to.

> As for memory reserve access, I feel that we should only set this flag
> if task == current... but this needs more discussion.

That would certainly be something to discuss. If we have other reliable
way to detect the oom victim and when it terminates then TIF_MEMDIE on
the current and only for memory reserves would be viable. Let's see
whether we can keep the killed mm around and use it as an indicator.
This would be a natural follow up cleanup.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-27 20:40     ` Oleg Nesterov
@ 2016-06-28 10:29       ` Michal Hocko
  2016-06-29 20:24         ` Oleg Nesterov
  0 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2016-06-28 10:29 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Tetsuo Handa, linux-mm, vdavydov, rientjes

On Mon 27-06-16 22:40:17, Oleg Nesterov wrote:
> On 06/25, Tetsuo Handa wrote:
> >
> > Oleg Nesterov wrote:
> > > And in any case I don't understand this patch but I have to admit that
> > > I failed to force myself to read the changelog and the actual change ;)
> > > In any case I agree that we should not set MMF_MEMDIE if ->mm == NULL,
> > > and if we ensure this then I do not understand why we can't rely on
> > > MMF_OOM_REAPED. Ignoring the obvious races, if ->oom_victims != 0 then
> > > find_lock_task_mm() should succed.
> >
> > Since we are using
> >
> >   mm = current->mm;
> >   current->mm = NULL;
> >   __mmput(mm); (may block for unbounded period waiting for somebody else's memory allocation)
> >   exit_oom_victim(current);
> >
> > sequence, we won't be able to make find_lock_task_mm(tsk) != NULL when
> > tsk->signal->oom_victims != 0 unless we change this sequence.
> 
> Ah, but this is clear, note the "Ignoring the obvious races" above.
> Can't we fix this race? I am a bit lost, but iirc we want this anyway
> to ensure that we do not set TIF_MEMDIE if ->mm == NULL ?

This is not about a race it is about not reaching exit_oom_victim and
unblock the oom killer from selecting another victim.

> Hmm. Although I am not sure I really understand the "may block for
> unbounded period ..." above. Do you mean khugepaged_exit?

__mmput->exit_aio can wait for IO to complete and who knows what that
might depend on. Who knows how many others are lurking there.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-28 10:19               ` Michal Hocko
@ 2016-06-29  0:13                 ` Oleg Nesterov
  2016-06-29  8:33                   ` Michal Hocko
  2016-06-29 20:14                 ` Oleg Nesterov
  1 sibling, 1 reply; 34+ messages in thread
From: Oleg Nesterov @ 2016-06-29  0:13 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, linux-mm, vdavydov, rientjes

Michal,

I am already sleeping, I'll try to reply to other parts of your email
(and other emails) tomorrow, just some notes about the patch you propose.

And cough sorry for noise... I personally hate-hate-hate every new "oom"
member you and Tetsuo add into task/signal_struct ;) But not in this case,
because I _think_ we need signal_struct->mm anyway in the long term.

So at first glance this patch makes sense, but unless I missed something
(the patch doesn't apply I can be easily wrong),

On 06/28, Michal Hocko wrote:
>
> @@ -245,6 +245,8 @@ static inline void free_signal_struct(struct signal_struct *sig)
>  {
>  	taskstats_tgid_free(sig);
>  	sched_autogroup_exit(sig);
> +	if (sig->oom_mm)
> +		mmdrop(sig->oom_mm);
>  	kmem_cache_free(signal_cachep, sig);
>  }

OK, iiuc this is not that bad because only oom-killer can set it,

> +void mark_oom_victim(struct task_struct *tsk, struct mm_struct *mm)
>  {
>  	WARN_ON(oom_killer_disabled);
>  	/* OOM killer might race with memcg OOM */
>  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
>  		return;
> +
>  	atomic_inc(&tsk->signal->oom_victims);
> +
> +	/* oom_mm is bound to the signal struct life time */
> +	if (!tsk->signal->oom_mm) {
> +		atomic_inc(&mm->mm_count);
> +		tsk->signal->oom_mm = mm;

Looks racy, but it is not because we rely on oom_lock? Perhaps a comment
makes sense.

> @@ -828,7 +816,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	struct task_struct *victim = p;
>  	struct task_struct *child;
>  	struct task_struct *t;
> -	struct mm_struct *mm;
> +	struct mm_struct *mm = READ_ONCE(p->mm);
>  	unsigned int victim_points = 0;
>  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>  					      DEFAULT_RATELIMIT_BURST);
> @@ -838,8 +826,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
>  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
>  	 */
> -	if (task_will_free_mem(p)) {
> -		mark_oom_victim(p);
> +	if (mm && task_will_free_mem(p)) {
> +		mark_oom_victim(p, mm);

And this looks really racy at first glance. Suppose that this memory hog execs
(this changes its ->mm) and then exits so that task_will_free_mem() == T, in
this case "mm" has nothing to do with tsk->mm and it can be already freed.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-29  0:13                 ` Oleg Nesterov
@ 2016-06-29  8:33                   ` Michal Hocko
  2016-06-29 14:19                     ` Michal Hocko
  2016-06-29 20:01                     ` Oleg Nesterov
  0 siblings, 2 replies; 34+ messages in thread
From: Michal Hocko @ 2016-06-29  8:33 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Tetsuo Handa, linux-mm, vdavydov, rientjes

On Wed 29-06-16 02:13:53, Oleg Nesterov wrote:
> Michal,
> 
> I am already sleeping, I'll try to reply to other parts of your email
> (and other emails) tomorrow, just some notes about the patch you propose.

Thanks!

> And cough sorry for noise... I personally hate-hate-hate every new "oom"
> member you and Tetsuo add into task/signal_struct ;)

I am not really happy about that either. I wish I could find a better
way...

> But not in this case, because I _think_ we need signal_struct->mm
> anyway in the long term.
> 
> So at first glance this patch makes sense, but unless I missed something
> (the patch doesn't apply I can be easily wrong),

This is on top of the current mmotm tree which contains other oom
changes.

[...]
> > +void mark_oom_victim(struct task_struct *tsk, struct mm_struct *mm)
> >  {
> >  	WARN_ON(oom_killer_disabled);
> >  	/* OOM killer might race with memcg OOM */
> >  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> >  		return;
> > +
> >  	atomic_inc(&tsk->signal->oom_victims);
> > +
> > +	/* oom_mm is bound to the signal struct life time */
> > +	if (!tsk->signal->oom_mm) {
> > +		atomic_inc(&mm->mm_count);
> > +		tsk->signal->oom_mm = mm;
> 
> Looks racy, but it is not because we rely on oom_lock? Perhaps a comment
> makes sense.

mark_oom_victim will be called only for the current or under the
task_lock so it should be stable. Except for...

> > @@ -838,8 +826,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  	 * If the task is already exiting, don't alarm the sysadmin or kill
> >  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
> >  	 */
> > -	if (task_will_free_mem(p)) {
> > -		mark_oom_victim(p);
> > +	if (mm && task_will_free_mem(p)) {
> > +		mark_oom_victim(p, mm);

This one. I didn't bother to cover it for the example patch but I have a
plan to address that. There are two possible ways. One is to pin
mm_count in oom_badness() so that we have a guarantee that it will not
get released from under us and the other one is to make
task_will_free_mem task_lock friendly and call this under the lock as we
used to.
 
> And this looks really racy at first glance. Suppose that this memory hog execs
> (this changes its ->mm) and then exits so that task_will_free_mem() == T, in
> this case "mm" has nothing to do with tsk->mm and it can be already freed.

Hmm, I didn't think about exec case. And I guess we have never cared
about that race. We just select a task and then kill it. The fact that
it is not sitting on the same memory anymore is silently ignored... But
I have to think about it more. I would be more worried about the use
after free.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-29  8:33                   ` Michal Hocko
@ 2016-06-29 14:19                     ` Michal Hocko
  2016-07-01 10:15                       ` Tetsuo Handa
  2016-06-29 20:01                     ` Oleg Nesterov
  1 sibling, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2016-06-29 14:19 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Tetsuo Handa, linux-mm, vdavydov, rientjes

OK, so this is what I have on top of [1]. I still haven't tested it but
at least added a changelog to have some reasoning for the change. If
this looks OK and passes my testing - which would be tricky anyway
because hitting those rare corner cases is quite hard - then the next
step would be to fix the race between suspend and oom_killer_disable
currently worked around by 74070542099c in a more robust way. We can
also start thinking to use TIF_MEMDIE only for the access to memory
reserves to oom victims which actually need to allocate and decouple the
current double meaning.

I completely understand a resistance to adding new stuff to the
signal_struct but this seems like worth it. I would like to have a
stable and existing mm for that purpose but that sounds like a more long
term plan than something we can do right away.

Thoughts?

[1] http://lkml.kernel.org/r/1467201562-6709-1-git-send-email-mhocko@kernel.org
---

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-28 10:26         ` Michal Hocko
@ 2016-06-29 19:34           ` Oleg Nesterov
  0 siblings, 0 replies; 34+ messages in thread
From: Oleg Nesterov @ 2016-06-29 19:34 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, linux-mm, vdavydov, rientjes

On 06/28, Michal Hocko wrote:
>
> On Mon 27-06-16 23:09:04, Oleg Nesterov wrote:
> > On 06/27, Michal Hocko wrote:
> > >
> > > Yes this is really unfortunate. I am trying to converge to per mm
> > > behavior as much as possible. We are getting there slowly but not yet
> > > there.
> >
> > Yes, agreed, everything should be per-mm.
> >
> > Say wake_oom_reaper/oom_reap_task. It is simply ugly we pass task_struct
> > to oom_reap_task(), it should work with mm_struct. Again, this is because
> > of TIF_MEMDIE/exit_oom_victim.  Except pr_info(), but this is minor...
>
> I was also tempted to get back to the mm based queing but I think that
> the pr_info is quite useful.

It is, I agree. But this is solveable, I think. If nothing else, we can even
do another for_each_thread() loop and report all tasks which use this mm, or
we can pass pid/mm tuple. Lets discus this later, this is not that important.

> > As for memory reserve access, I feel that we should only set this flag
> > if task == current... but this needs more discussion.
>
> That would certainly be something to discuss. If we have other reliable
> way to detect the oom victim and when it terminates then TIF_MEMDIE on
> the current and only for memory reserves would be viable. Let's see
> whether we can keep the killed mm around and use it as an indicator.
> This would be a natural follow up cleanup.

Agreed, this looks certainly better than what we have now. Although I am
not sure I fully understand the details, but it seems that everything would
be better anyway ;)

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-29  8:33                   ` Michal Hocko
  2016-06-29 14:19                     ` Michal Hocko
@ 2016-06-29 20:01                     ` Oleg Nesterov
  2016-06-30  7:59                       ` Michal Hocko
  1 sibling, 1 reply; 34+ messages in thread
From: Oleg Nesterov @ 2016-06-29 20:01 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, linux-mm, vdavydov, rientjes

On 06/29, Michal Hocko wrote:
>
> > > +void mark_oom_victim(struct task_struct *tsk, struct mm_struct *mm)
> > >  {
> > >  	WARN_ON(oom_killer_disabled);
> > >  	/* OOM killer might race with memcg OOM */
> > >  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> > >  		return;
> > > +
> > >  	atomic_inc(&tsk->signal->oom_victims);
> > > +
> > > +	/* oom_mm is bound to the signal struct life time */
> > > +	if (!tsk->signal->oom_mm) {
> > > +		atomic_inc(&mm->mm_count);
> > > +		tsk->signal->oom_mm = mm;
> >
> > Looks racy, but it is not because we rely on oom_lock? Perhaps a comment
> > makes sense.
>
> mark_oom_victim will be called only for the current or under the
> task_lock so it should be stable. Except for...

I meant that the code looks racy because 2 threads can see ->oom_mm == NULL
at the same time and in this case we have the extra atomic_inc(mm_count).
But I guess oom_lock saves us, so the code is correct but not clear.

> > > @@ -838,8 +826,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> > >  	 * If the task is already exiting, don't alarm the sysadmin or kill
> > >  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
> > >  	 */
> > > -	if (task_will_free_mem(p)) {
> > > -		mark_oom_victim(p);
> > > +	if (mm && task_will_free_mem(p)) {
> > > +		mark_oom_victim(p, mm);
>
> This one. I didn't bother to cover it for the example patch but I have a
> plan to address that. There are two possible ways. One is to pin
> mm_count in oom_badness() so that we have a guarantee that it will not

I thought about this too. And I think that select_bad_process() should even
return mm_struct or at least a task_lock'ed task for the start.

> > And this looks really racy at first glance. Suppose that this memory hog execs
> > (this changes its ->mm) and then exits so that task_will_free_mem() == T, in
> > this case "mm" has nothing to do with tsk->mm and it can be already freed.
>
> Hmm, I didn't think about exec case. And I guess we have never cared
> about that race. We just select a task and then kill it.

And I guess we want to fix this too, although this is not that important,
but this looks like a minor security problem.

And this is another indication that almost everything oom-kill.c does with
task_struct is wrong ;) Ideally It should only use task_struct to send the
SIGKILL, and now that we kill all users of victim->mm we can hopefully do
this later.

Btw, do we still need this list_for_each_entry(child, &t->children, sibling)
loop in oom_kill_process() ?

> I would be more worried about the use
> after free.

Yes, yes, this is what I meant.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-28 10:19               ` Michal Hocko
  2016-06-29  0:13                 ` Oleg Nesterov
@ 2016-06-29 20:14                 ` Oleg Nesterov
  2016-06-30  8:07                   ` Michal Hocko
  1 sibling, 1 reply; 34+ messages in thread
From: Oleg Nesterov @ 2016-06-29 20:14 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, linux-mm, vdavydov, rientjes

On 06/28, Michal Hocko wrote:
>
> On Mon 27-06-16 19:55:55, Oleg Nesterov wrote:
> > On 06/27, Michal Hocko wrote:
> > >
> > > On Mon 27-06-16 17:51:20, Oleg Nesterov wrote:
> > > >
> > > > Yes I agree, it would be nice to remove find_lock_task_mm(). And in
> > > > fact it would be nice to kill task_struct->mm (but this needs a lot
> > > > of cleanups). We probably want signal_struct->mm, but this is a bit
> > > > complicated (locking).
> > >
> > > Is there any hard requirement to reset task_struct::mm in the first
> > > place?
> >
> > Well, at least the scheduler needs this.
>
> Could you point me to where it depends on that? I mean if we are past
> exit_mm then we have unmapped the address space most probably but why
> should we care about that in the scheduler? There shouldn't be any
> further access to the address space by that point. I can see that
> context_switch() checks task->mm but it should just work when it sees it
> non NULL, right?

But who will do the final mmdrop() then? I am not saying this is impossible
to change, say we do this in finish_task_switch(TASK_DEAD) or even in
free_task(), but we do not want this?

> Do you think this would be a way to go, though? We would have to special
> case this because the mm_struct is quite large (~900B with my config) so
> we would keep and pin it only for oom victims.

Plus page tables, so it is more than 900B. But as I said, personally I agree
with signal->oom_mm which can only be set by oom-killer.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-28 10:29       ` Michal Hocko
@ 2016-06-29 20:24         ` Oleg Nesterov
  2016-06-30  8:16           ` Michal Hocko
  0 siblings, 1 reply; 34+ messages in thread
From: Oleg Nesterov @ 2016-06-29 20:24 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, linux-mm, vdavydov, rientjes

On 06/28, Michal Hocko wrote:
>
> On Mon 27-06-16 22:40:17, Oleg Nesterov wrote:
> >
> > Ah, but this is clear, note the "Ignoring the obvious races" above.
> > Can't we fix this race? I am a bit lost, but iirc we want this anyway
> > to ensure that we do not set TIF_MEMDIE if ->mm == NULL ?
>
> This is not about a race it is about not reaching exit_oom_victim and
> unblock the oom killer from selecting another victim.

I understand. What I do not understand why we can't rely on MMF_OOM_REAPED
if we ensure that TIF_MEMDIE can only be set if the victim did not call
exit_oom_victim() yet.

OK, please forget, I already got lost and right now I don't even have the
uptodate -mm tree sources.

> > Hmm. Although I am not sure I really understand the "may block for
> > unbounded period ..." above. Do you mean khugepaged_exit?
>
> __mmput->exit_aio can wait for IO to complete and who knows what that
> might depend on.

Yes, but I was confused by "waiting for somebody else's memory allocation",
I do not this this apllies to exit_aio.

Nevermind,

> Who knows how many others are lurking there.

Yes, yes, I agree. Just I wrongly thought Tetsuo meant something particular.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-29 20:01                     ` Oleg Nesterov
@ 2016-06-30  7:59                       ` Michal Hocko
  2016-06-30 10:51                         ` Tetsuo Handa
  2016-07-03 13:21                         ` Oleg Nesterov
  0 siblings, 2 replies; 34+ messages in thread
From: Michal Hocko @ 2016-06-30  7:59 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Tetsuo Handa, linux-mm, vdavydov, rientjes

On Wed 29-06-16 22:01:08, Oleg Nesterov wrote:
> On 06/29, Michal Hocko wrote:
> >
> > > > +void mark_oom_victim(struct task_struct *tsk, struct mm_struct *mm)
> > > >  {
> > > >  	WARN_ON(oom_killer_disabled);
> > > >  	/* OOM killer might race with memcg OOM */
> > > >  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> > > >  		return;
> > > > +
> > > >  	atomic_inc(&tsk->signal->oom_victims);
> > > > +
> > > > +	/* oom_mm is bound to the signal struct life time */
> > > > +	if (!tsk->signal->oom_mm) {
> > > > +		atomic_inc(&mm->mm_count);
> > > > +		tsk->signal->oom_mm = mm;
> > >
> > > Looks racy, but it is not because we rely on oom_lock? Perhaps a comment
> > > makes sense.
> >
> > mark_oom_victim will be called only for the current or under the
> > task_lock so it should be stable. Except for...
> 
> I meant that the code looks racy because 2 threads can see ->oom_mm == NULL
> at the same time and in this case we have the extra atomic_inc(mm_count).
> But I guess oom_lock saves us, so the code is correct but not clear.

I have changed that to cmpxchg because lowmemory killer is called
outside of oom_lock.

> > > > @@ -838,8 +826,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> > > >  	 * If the task is already exiting, don't alarm the sysadmin or kill
> > > >  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
> > > >  	 */
> > > > -	if (task_will_free_mem(p)) {
> > > > -		mark_oom_victim(p);
> > > > +	if (mm && task_will_free_mem(p)) {
> > > > +		mark_oom_victim(p, mm);
> >
> > This one. I didn't bother to cover it for the example patch but I have a
> > plan to address that. There are two possible ways. One is to pin
> > mm_count in oom_badness() so that we have a guarantee that it will not
> 
> I thought about this too. And I think that select_bad_process() should even
> return mm_struct or at least a task_lock'ed task for the start.

Yes that would be a plan if I pinned the mm struct in oom_badness. I
ended up using task_lock around task_will_free_mem so it should be goot
for now. Let's see whether we can be more clever about that later.

> > > And this looks really racy at first glance. Suppose that this memory hog execs
> > > (this changes its ->mm) and then exits so that task_will_free_mem() == T, in
> > > this case "mm" has nothing to do with tsk->mm and it can be already freed.
> >
> > Hmm, I didn't think about exec case. And I guess we have never cared
> > about that race. We just select a task and then kill it.
> 
> And I guess we want to fix this too, although this is not that important,
> but this looks like a minor security problem.

I am not sure I can see security implications but I agree this is less
than optimal, albeit not critical. Killing a young process which didn't
have much time to do a useful work doesn't seem that critical. It would
be much better to kill the real holder of the mm though!

> And this is another indication that almost everything oom-kill.c does with
> task_struct is wrong ;) Ideally It should only use task_struct to send the
> SIGKILL, and now that we kill all users of victim->mm we can hopefully do
> this later.

Hmm, so you think we should do s@victim@mm_victim@ and then do the
for_each_process loop to kill all the tasks sharing that mm and kill
them? We are doing that already so it doesn't sound that bad...

> Btw, do we still need this list_for_each_entry(child, &t->children, sibling)
> loop in oom_kill_process() ?

Well, to be honest, I don't know. This is a heuristic we have been doing
for a long time. I do not know how many times it really matters. It can
even be harmful in loads where children are created in the same pace OOM
killer is killing them. Not sure how likely is that though...
Let me think whether we can do something about that.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-29 20:14                 ` Oleg Nesterov
@ 2016-06-30  8:07                   ` Michal Hocko
  2016-07-03 13:24                     ` Oleg Nesterov
  0 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2016-06-30  8:07 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Tetsuo Handa, linux-mm, vdavydov, rientjes

On Wed 29-06-16 22:14:09, Oleg Nesterov wrote:
> On 06/28, Michal Hocko wrote:
> >
> > On Mon 27-06-16 19:55:55, Oleg Nesterov wrote:
> > > On 06/27, Michal Hocko wrote:
> > > >
> > > > On Mon 27-06-16 17:51:20, Oleg Nesterov wrote:
> > > > >
> > > > > Yes I agree, it would be nice to remove find_lock_task_mm(). And in
> > > > > fact it would be nice to kill task_struct->mm (but this needs a lot
> > > > > of cleanups). We probably want signal_struct->mm, but this is a bit
> > > > > complicated (locking).
> > > >
> > > > Is there any hard requirement to reset task_struct::mm in the first
> > > > place?
> > >
> > > Well, at least the scheduler needs this.
> >
> > Could you point me to where it depends on that? I mean if we are past
> > exit_mm then we have unmapped the address space most probably but why
> > should we care about that in the scheduler? There shouldn't be any
> > further access to the address space by that point. I can see that
> > context_switch() checks task->mm but it should just work when it sees it
> > non NULL, right?
> 
> But who will do the final mmdrop() then? I am not saying this is impossible
> to change, say we do this in finish_task_switch(TASK_DEAD) or even in
> free_task(), but we do not want this?

I thought it could be done somewhere in release_task after we unhash
the process but then we would need something for the exlusion (possibly
task_lock) to handle races when the oom killer sees a task while it is
being unhashed. I guess it should be doable...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-29 20:24         ` Oleg Nesterov
@ 2016-06-30  8:16           ` Michal Hocko
  0 siblings, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2016-06-30  8:16 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Tetsuo Handa, linux-mm, vdavydov, rientjes

On Wed 29-06-16 22:24:24, Oleg Nesterov wrote:
> On 06/28, Michal Hocko wrote:
> >
> > On Mon 27-06-16 22:40:17, Oleg Nesterov wrote:
> > >
> > > Ah, but this is clear, note the "Ignoring the obvious races" above.
> > > Can't we fix this race? I am a bit lost, but iirc we want this anyway
> > > to ensure that we do not set TIF_MEMDIE if ->mm == NULL ?
> >
> > This is not about a race it is about not reaching exit_oom_victim and
> > unblock the oom killer from selecting another victim.
> 
> I understand. What I do not understand why we can't rely on MMF_OOM_REAPED
> if we ensure that TIF_MEMDIE can only be set if the victim did not call
> exit_oom_victim() yet.
> 
> OK, please forget, I already got lost and right now I don't even have the
> uptodate -mm tree sources.
> 
> > > Hmm. Although I am not sure I really understand the "may block for
> > > unbounded period ..." above. Do you mean khugepaged_exit?
> >
> > __mmput->exit_aio can wait for IO to complete and who knows what that
> > might depend on.
> 
> Yes, but I was confused by "waiting for somebody else's memory allocation",
> I do not this this apllies to exit_aio.

To be honest I really don't know. I am just assuming the worst. And IO
sometimes need to allocate to move on.

> Nevermind,
> 
> > Who knows how many others are lurking there.
> 
> Yes, yes, I agree. Just I wrongly thought Tetsuo meant something particular.

I guess we just want to be conservative here and make sure we do not
want to depend on the particular implementation details.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-30  7:59                       ` Michal Hocko
@ 2016-06-30 10:51                         ` Tetsuo Handa
  2016-06-30 11:21                           ` Michal Hocko
  2016-07-03 13:32                           ` Oleg Nesterov
  2016-07-03 13:21                         ` Oleg Nesterov
  1 sibling, 2 replies; 34+ messages in thread
From: Tetsuo Handa @ 2016-06-30 10:51 UTC (permalink / raw)
  To: mhocko, oleg; +Cc: linux-mm, vdavydov, rientjes

Michal Hocko wrote:
> On Wed 29-06-16 22:01:08, Oleg Nesterov wrote:
> > On 06/29, Michal Hocko wrote:
> > >
> > > > > +void mark_oom_victim(struct task_struct *tsk, struct mm_struct *mm)
> > > > >  {
> > > > >  	WARN_ON(oom_killer_disabled);
> > > > >  	/* OOM killer might race with memcg OOM */
> > > > >  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> > > > >  		return;
> > > > > +
> > > > >  	atomic_inc(&tsk->signal->oom_victims);
> > > > > +
> > > > > +	/* oom_mm is bound to the signal struct life time */
> > > > > +	if (!tsk->signal->oom_mm) {
> > > > > +		atomic_inc(&mm->mm_count);
> > > > > +		tsk->signal->oom_mm = mm;
> > > >
> > > > Looks racy, but it is not because we rely on oom_lock? Perhaps a comment
> > > > makes sense.
> > >
> > > mark_oom_victim will be called only for the current or under the
> > > task_lock so it should be stable. Except for...
> > 
> > I meant that the code looks racy because 2 threads can see ->oom_mm == NULL
> > at the same time and in this case we have the extra atomic_inc(mm_count).
> > But I guess oom_lock saves us, so the code is correct but not clear.
> 
> I have changed that to cmpxchg because lowmemory killer is called
> outside of oom_lock.

Android's lowmemory killer is no longer using mark_oom_victim().

> > Btw, do we still need this list_for_each_entry(child, &t->children, sibling)
> > loop in oom_kill_process() ?
> 
> Well, to be honest, I don't know. This is a heuristic we have been doing
> for a long time. I do not know how many times it really matters. It can
> even be harmful in loads where children are created in the same pace OOM
> killer is killing them. Not sure how likely is that though...
> Let me think whether we can do something about that.

I'm using that behavior in order to test almost OOM situation. ;)



By the way, are you going to fix use_mm() race? Currently, we don't wake up
OOM reaper if some kernel thread is holding a reference to that mm via
use_mm(). But currently we can hit

  (1) OOM killer fails to find use_mm() users using for_each_process() in
      oom_kill_process() and wakes up OOM reaper.

  (2) Some kernel thread calls use_mm().

  (3) OOM reaper ignores use_mm() users and reaps that mm.

race. I think we need to make use_mm() fail after mark_oom_victim() is called.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-30 10:51                         ` Tetsuo Handa
@ 2016-06-30 11:21                           ` Michal Hocko
  2016-07-03 13:32                           ` Oleg Nesterov
  1 sibling, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2016-06-30 11:21 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: oleg, linux-mm, vdavydov, rientjes

On Thu 30-06-16 19:51:53, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > I have changed that to cmpxchg because lowmemory killer is called
> > outside of oom_lock.
> 
> Android's lowmemory killer is no longer using mark_oom_victim().

You are right! The mmotm tree doesn't have the patch because it was
routed via Greg. I will probably keep the cmpxchg, though, because it
it less error prone and doesn't add much...
 
[...]

> By the way, are you going to fix use_mm() race? Currently, we don't wake up
> OOM reaper if some kernel thread is holding a reference to that mm via
> use_mm(). But currently we can hit
> 
>   (1) OOM killer fails to find use_mm() users using for_each_process() in
>       oom_kill_process() and wakes up OOM reaper.
> 
>   (2) Some kernel thread calls use_mm().
> 
>   (3) OOM reaper ignores use_mm() users and reaps that mm.
> 
> race. I think we need to make use_mm() fail after mark_oom_victim() is called.

Considering it would matter only for the vhost which I would like to be
oom reaper safe I would prefer to do the later and not treat kthreads
special.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-29 14:19                     ` Michal Hocko
@ 2016-07-01 10:15                       ` Tetsuo Handa
  0 siblings, 0 replies; 34+ messages in thread
From: Tetsuo Handa @ 2016-07-01 10:15 UTC (permalink / raw)
  To: mhocko, oleg; +Cc: linux-mm, vdavydov, rientjes

Michal Hocko wrote:
> OK, so this is what I have on top of [1]. I still haven't tested it but
> at least added a changelog to have some reasoning for the change. If
> this looks OK and passes my testing - which would be tricky anyway
> because hitting those rare corner cases is quite hard - then the next
> step would be to fix the race between suspend and oom_killer_disable
> currently worked around by 74070542099c in a more robust way. We can
> also start thinking to use TIF_MEMDIE only for the access to memory
> reserves to oom victims which actually need to allocate and decouple the
> current double meaning.
> 
> I completely understand a resistance to adding new stuff to the
> signal_struct but this seems like worth it. I would like to have a
> stable and existing mm for that purpose but that sounds like a more long
> term plan than something we can do right away.
> 
> Thoughts?

If adding new stuff to the mm_struct is acceptable (like patch shown below),
we can remove

  (1) oom_victims from the signal_struct
  (2) oom_reaper_list from the task_struct
  (3) exit_oom_victim() on remote tasks
  (4) commit 74070542099c66d8 ("oom, suspend: fix oom_reaper vs.
      oom_killer_disable race") due to (3)
  (5) OOM_SCAN_ABORT case from oom_scan_process_thread() because
      oom_has_pending_mm() is called before starting victim selection loop
  (6) commit e2fe14564d3316d1 ("oom_reaper: close race with exiting task")
      because oom_has_pending_mm() remains true until exit_oom_mm() is called
      from __mmput() or oom_reaper()

and make the OOM reaper operate on mm_struct.

Also, after patch shown below is applied, we can remove almost pointless
oom_unkillable_task() test in oom_scan_process_thread() which is needed
only for doing oom_task_origin() test which can be done after
oom_unkillable_task() test in oom_badness() (i.e. we can remove
oom_scan_process_thread() entirely).

 include/linux/mm_types.h |  10 ++
 include/linux/oom.h      |  16 +--
 include/linux/sched.h    |   4 -
 kernel/exit.c            |   2 +-
 kernel/fork.c            |   1 +
 mm/memcontrol.c          |  18 ++--
 mm/oom_kill.c            | 262 +++++++++++++++++++++--------------------------
 7 files changed, 142 insertions(+), 171 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e093e1d..84b74ad 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -13,6 +13,7 @@
 #include <linux/uprobes.h>
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
+#include <linux/nodemask.h> /* nodemask_t */
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -392,6 +393,14 @@ struct mm_rss_stat {
 	atomic_long_t count[NR_MM_COUNTERS];
 };
 
+struct oom_mm {
+	struct list_head list; /* Linked to oom_mm_list list. */
+	struct mem_cgroup *memcg; /* No deref. Maybe NULL. */
+	const nodemask_t *nodemask; /* No deref. Maybe NULL. */
+	char comm[16]; /* Copy of task_struct->comm[TASK_COMM_LEN]. */
+	pid_t pid; /* Copy of task_struct->pid. */
+};
+
 struct kioctx_table;
 struct mm_struct {
 	struct vm_area_struct *mmap;		/* list of VMAs */
@@ -518,6 +527,7 @@ struct mm_struct {
 #ifdef CONFIG_MMU
 	struct work_struct async_put_work;
 #endif
+	struct oom_mm oom_mm;
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5bc0457..1a212c1 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -49,7 +49,6 @@ enum oom_constraint {
 enum oom_scan_t {
 	OOM_SCAN_OK,		/* scan thread and find its badness */
 	OOM_SCAN_CONTINUE,	/* do not consider thread for oom kill */
-	OOM_SCAN_ABORT,		/* abort the iteration and return */
 	OOM_SCAN_SELECT,	/* always select this thread first */
 };
 
@@ -70,15 +69,7 @@ static inline bool oom_task_origin(const struct task_struct *p)
 	return p->signal->oom_flag_origin;
 }
 
-extern void mark_oom_victim(struct task_struct *tsk);
-
-#ifdef CONFIG_MMU
-extern void wake_oom_reaper(struct task_struct *tsk);
-#else
-static inline void wake_oom_reaper(struct task_struct *tsk)
-{
-}
-#endif
+extern void mark_oom_victim(struct task_struct *tsk, struct oom_control *oc);
 
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
@@ -91,12 +82,15 @@ extern void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 extern void check_panic_on_oom(struct oom_control *oc,
 			       enum oom_constraint constraint);
 
+extern void exit_oom_mm(struct mm_struct *mm);
+extern bool oom_has_pending_mm(struct mem_cgroup *memcg,
+			       const nodemask_t *nodemask);
 extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 					       struct task_struct *task);
 
 extern bool out_of_memory(struct oom_control *oc);
 
-extern void exit_oom_victim(struct task_struct *tsk);
+extern void exit_oom_victim(void);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 553af29..4379279 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -671,7 +671,6 @@ struct signal_struct {
 	atomic_t		sigcnt;
 	atomic_t		live;
 	int			nr_threads;
-	atomic_t oom_victims; /* # of TIF_MEDIE threads in this thread group */
 	struct list_head	thread_head;
 
 	wait_queue_head_t	wait_chldexit;	/* for wait4() */
@@ -1917,9 +1916,6 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
-#ifdef CONFIG_MMU
-	struct task_struct *oom_reaper_list;
-#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/kernel/exit.c b/kernel/exit.c
index 84ae830..1b1dada 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -511,7 +511,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim(tsk);
+		exit_oom_victim();
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/kernel/fork.c b/kernel/fork.c
index 7926993..b870dbc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -722,6 +722,7 @@ static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
+	exit_oom_mm(mm);
 	mmdrop(mm);
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 40dfca3..bf4c8f8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1235,12 +1235,18 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * quickly exit and free its memory.
 	 */
 	if (task_will_free_mem(current)) {
-		mark_oom_victim(current);
-		wake_oom_reaper(current);
+		mark_oom_victim(current, &oc);
 		goto unlock;
 	}
 
 	check_panic_on_oom(&oc, CONSTRAINT_MEMCG);
+
+	if (oom_has_pending_mm(memcg, NULL)) {
+		/* Set a dummy value to return "true". */
+		chosen = (void *) 1;
+		goto unlock;
+	}
+
 	totalpages = mem_cgroup_get_limit(memcg) ? : 1;
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct css_task_iter it;
@@ -1258,14 +1264,6 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 				/* fall through */
 			case OOM_SCAN_CONTINUE:
 				continue;
-			case OOM_SCAN_ABORT:
-				css_task_iter_end(&it);
-				mem_cgroup_iter_break(memcg, iter);
-				if (chosen)
-					put_task_struct(chosen);
-				/* Set a dummy value to return "true". */
-				chosen = (void *) 1;
-				goto unlock;
 			case OOM_SCAN_OK:
 				break;
 			};
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7d0a275..c9e4b53 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -275,6 +275,50 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc,
 }
 #endif
 
+static LIST_HEAD(oom_mm_list);
+static DEFINE_SPINLOCK(oom_mm_list_lock);
+
+/*
+ * This function might be called concurrently from both exiting thread
+ * and the OOM reaper thread.
+ */
+void exit_oom_mm(struct mm_struct *mm)
+{
+	bool remove;
+
+	if (!mm->oom_mm.list.next)
+		return;
+	/* Disconnect from oom_mm_list only if still connected. */
+	spin_lock(&oom_mm_list_lock);
+	remove = mm->oom_mm.list.next;
+	if (remove) {
+		list_del(&mm->oom_mm.list);
+		mm->oom_mm.list.next = NULL;
+	}
+	spin_unlock(&oom_mm_list_lock);
+	/* Drop a reference taken by mark_oom_victim() */
+	if (remove)
+		mmdrop(mm);
+}
+
+bool oom_has_pending_mm(struct mem_cgroup *memcg, const nodemask_t *nodemask)
+{
+	struct oom_mm *mm;
+	bool ret = false;
+
+	spin_lock(&oom_mm_list_lock);
+	list_for_each_entry(mm, &oom_mm_list, list) {
+		if (memcg && mm->memcg != memcg)
+			continue;
+		if (nodemask && mm->nodemask != nodemask)
+			continue;
+		ret = true;
+		break;
+	}
+	spin_unlock(&oom_mm_list_lock);
+	return ret;
+}
+
 enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 					struct task_struct *task)
 {
@@ -282,25 +326,6 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 		return OOM_SCAN_CONTINUE;
 
 	/*
-	 * This task already has access to memory reserves and is being killed.
-	 * Don't allow any other task to have access to the reserves unless
-	 * the task has MMF_OOM_REAPED because chances that it would release
-	 * any memory is quite low.
-	 */
-	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
-		struct task_struct *p = find_lock_task_mm(task);
-		enum oom_scan_t ret = OOM_SCAN_ABORT;
-
-		if (p) {
-			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
-				ret = OOM_SCAN_CONTINUE;
-			task_unlock(p);
-		}
-
-		return ret;
-	}
-
-	/*
 	 * If task is allocating a lot of memory and has been marked to be
 	 * killed first if it triggers an oom, then select it.
 	 */
@@ -332,9 +357,6 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
 			/* fall through */
 		case OOM_SCAN_CONTINUE:
 			continue;
-		case OOM_SCAN_ABORT:
-			rcu_read_unlock();
-			return (struct task_struct *)(-1UL);
 		case OOM_SCAN_OK:
 			break;
 		};
@@ -447,54 +469,17 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
  * victim (if that is possible) to help the OOM killer to move on.
  */
-static struct task_struct *oom_reaper_th;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
-static DEFINE_SPINLOCK(oom_reaper_lock);
 
-static bool __oom_reap_task(struct task_struct *tsk)
+static bool __oom_reap_vma(struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	struct mm_struct *mm = NULL;
-	struct task_struct *p;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
-	bool ret = true;
-
-	/*
-	 * We have to make sure to not race with the victim exit path
-	 * and cause premature new oom victim selection:
-	 * __oom_reap_task		exit_mm
-	 *   mmget_not_zero
-	 *				  mmput
-	 *				    atomic_dec_and_test
-	 *				  exit_oom_victim
-	 *				[...]
-	 *				out_of_memory
-	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
-	 *  unmap_page_range # frees some memory
-	 */
-	mutex_lock(&oom_lock);
-
-	/*
-	 * Make sure we find the associated mm_struct even when the particular
-	 * thread has already terminated and cleared its mm.
-	 * We might have race with exit path so consider our work done if there
-	 * is no mm.
-	 */
-	p = find_lock_task_mm(tsk);
-	if (!p)
-		goto unlock_oom;
-	mm = p->mm;
-	atomic_inc(&mm->mm_count);
-	task_unlock(p);
 
-	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
-		goto mm_drop;
-	}
+	if (!down_read_trylock(&mm->mmap_sem))
+		return false;
 
 	/*
 	 * increase mm_users only after we know we will reap something so
@@ -503,7 +488,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 */
 	if (!mmget_not_zero(mm)) {
 		up_read(&mm->mmap_sem);
-		goto mm_drop;
+		return true;
 	}
 
 	tlb_gather_mmu(&tlb, mm, 0, -1);
@@ -534,7 +519,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
-			task_pid_nr(tsk), tsk->comm,
+			mm->oom_mm.pid, mm->oom_mm.comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
@@ -551,27 +536,32 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 * put the oom_reaper out of the way.
 	 */
 	mmput_async(mm);
-mm_drop:
-	mmdrop(mm);
-unlock_oom:
-	mutex_unlock(&oom_lock);
-	return ret;
+	return true;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
-static void oom_reap_task(struct task_struct *tsk)
+static void oom_reap_vma(struct mm_struct *mm)
 {
 	int attempts = 0;
+	bool ret;
+
+	/*
+	 * Check MMF_OOM_REAPED after holding oom_lock because
+	 * oom_kill_process() might find this mm pinned.
+	 */
+	mutex_lock(&oom_lock);
+	ret = test_bit(MMF_OOM_REAPED, &mm->flags);
+	mutex_unlock(&oom_lock);
+	if (ret)
+		return;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
+	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_vma(mm))
 		schedule_timeout_idle(HZ/10);
 
 	if (attempts > MAX_OOM_REAP_RETRIES) {
-		struct task_struct *p;
-
 		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-				task_pid_nr(tsk), tsk->comm);
+			mm->oom_mm.pid, mm->oom_mm.comm);
 
 		/*
 		 * If we've already tried to reap this task in the past and
@@ -579,30 +569,14 @@ static void oom_reap_task(struct task_struct *tsk)
 		 * so hide the mm from the oom killer so that it can move on
 		 * to another task with a different mm struct.
 		 */
-		p = find_lock_task_mm(tsk);
-		if (p) {
-			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
-				pr_info("oom_reaper: giving up pid:%d (%s)\n",
-						task_pid_nr(tsk), tsk->comm);
-				set_bit(MMF_OOM_REAPED, &p->mm->flags);
-			}
-			task_unlock(p);
+		if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) {
+			pr_info("oom_reaper: giving up pid:%d (%s)\n",
+				mm->oom_mm.pid, mm->oom_mm.comm);
+			set_bit(MMF_OOM_REAPED, &mm->flags);
 		}
 
 		debug_show_all_locks();
 	}
-
-	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore or it is not a good candidate
-	 * for the oom victim right now because it cannot release its memory
-	 * itself nor by the oom reaper.
-	 */
-	tsk->oom_reaper_list = NULL;
-	exit_oom_victim(tsk);
-
-	/* Drop a reference taken by wake_oom_reaper */
-	put_task_struct(tsk);
 }
 
 static int oom_reaper(void *unused)
@@ -610,49 +584,33 @@ static int oom_reaper(void *unused)
 	set_freezable();
 
 	while (true) {
-		struct task_struct *tsk = NULL;
+		struct mm_struct *mm = NULL;
 
-		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
-		spin_lock(&oom_reaper_lock);
-		if (oom_reaper_list != NULL) {
-			tsk = oom_reaper_list;
-			oom_reaper_list = tsk->oom_reaper_list;
-		}
-		spin_unlock(&oom_reaper_lock);
+		wait_event_freezable(oom_reaper_wait,
+				     !list_empty(&oom_mm_list));
+		spin_lock(&oom_mm_list_lock);
+		if (!list_empty(&oom_mm_list))
+			mm = list_first_entry(&oom_mm_list, struct mm_struct,
+					      oom_mm.list);
+		spin_unlock(&oom_mm_list_lock);
 
-		if (tsk)
-			oom_reap_task(tsk);
+		if (!mm)
+			continue;
+		oom_reap_vma(mm);
+		exit_oom_mm(mm);
 	}
 
 	return 0;
 }
 
-void wake_oom_reaper(struct task_struct *tsk)
-{
-	if (!oom_reaper_th)
-		return;
-
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
-		return;
-
-	get_task_struct(tsk);
-
-	spin_lock(&oom_reaper_lock);
-	tsk->oom_reaper_list = oom_reaper_list;
-	oom_reaper_list = tsk;
-	spin_unlock(&oom_reaper_lock);
-	wake_up(&oom_reaper_wait);
-}
-
 static int __init oom_init(void)
 {
-	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
-	if (IS_ERR(oom_reaper_th)) {
-		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
-				PTR_ERR(oom_reaper_th));
-		oom_reaper_th = NULL;
-	}
+	/*
+	 * If this kthread_run() call triggers OOM killer, the system will
+	 * panic() because this function is called before userspace processes
+	 * are started. Thus, there is no point with checking failure.
+	 */
+	kthread_run(oom_reaper, NULL, "oom_reaper");
 	return 0;
 }
 subsys_initcall(oom_init)
@@ -665,13 +623,33 @@ subsys_initcall(oom_init)
  * Has to be called with oom_lock held and never after
  * oom has been disabled already.
  */
-void mark_oom_victim(struct task_struct *tsk)
+void mark_oom_victim(struct task_struct *tsk, struct oom_control *oc)
 {
+	struct mm_struct *mm = tsk->mm;
+
 	WARN_ON(oom_killer_disabled);
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
-	atomic_inc(&tsk->signal->oom_victims);
+	/*
+	 * Connect to oom_mm_list only if not connected.
+	 * Since we have stable mm, exit_oom_mm() from __mmput() can't be
+	 * called before we add this mm.
+	 */
+	spin_lock(&oom_mm_list_lock);
+	if (!mm->oom_mm.list.next) {
+		atomic_inc(&mm->mm_count);
+		mm->oom_mm.memcg = oc->memcg;
+		mm->oom_mm.nodemask = oc->nodemask;
+		strncpy(mm->oom_mm.comm, tsk->comm, sizeof(mm->oom_mm.comm));
+		mm->oom_mm.pid = task_pid_nr(tsk);
+		list_add_tail(&mm->oom_mm.list, &oom_mm_list);
+#ifdef CONFIG_MMU
+		wake_up(&oom_reaper_wait);
+#endif
+	}
+	spin_unlock(&oom_mm_list_lock);
+
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
@@ -685,11 +663,9 @@ void mark_oom_victim(struct task_struct *tsk)
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
-void exit_oom_victim(struct task_struct *tsk)
+void exit_oom_victim(void)
 {
-	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
-		return;
-	atomic_dec(&tsk->signal->oom_victims);
+	clear_thread_flag(TIF_MEMDIE);
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
@@ -821,7 +797,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -829,8 +804,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 */
 	task_lock(p);
 	if (task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		wake_oom_reaper(p);
+		mark_oom_victim(p, oc);
 		task_unlock(p);
 		put_task_struct(p);
 		return;
@@ -890,7 +864,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * space under its control.
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
-	mark_oom_victim(victim);
+	mark_oom_victim(victim, oc);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -920,7 +894,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			 * memory might be still used. Hide the mm from the oom
 			 * killer to guarantee OOM forward progress.
 			 */
-			can_oom_reap = false;
 			set_bit(MMF_OOM_REAPED, &mm->flags);
 			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
 					task_pid_nr(victim), victim->comm,
@@ -931,9 +904,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
-
 	mmdrop(mm);
 	put_task_struct(victim);
 }
@@ -1008,8 +978,7 @@ bool out_of_memory(struct oom_control *oc)
 	 * quickly exit and free its memory.
 	 */
 	if (task_will_free_mem(current)) {
-		mark_oom_victim(current);
-		wake_oom_reaper(current);
+		mark_oom_victim(current, oc);
 		return true;
 	}
 
@@ -1040,13 +1009,16 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
+	if (!is_sysrq_oom(oc) && oom_has_pending_mm(oc->memcg, oc->nodemask))
+		return true;
+
 	p = select_bad_process(oc, &points, totalpages);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p && !is_sysrq_oom(oc)) {
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (p && p != (void *)-1UL) {
+	if (p) {
 		oom_kill_process(oc, p, points, totalpages, "Out of memory");
 		/*
 		 * Give the killed process a good chance to exit before trying
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-30  7:59                       ` Michal Hocko
  2016-06-30 10:51                         ` Tetsuo Handa
@ 2016-07-03 13:21                         ` Oleg Nesterov
  2016-07-07 11:51                           ` Michal Hocko
  1 sibling, 1 reply; 34+ messages in thread
From: Oleg Nesterov @ 2016-07-03 13:21 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, linux-mm, vdavydov, rientjes

On 06/30, Michal Hocko wrote:
> On Wed 29-06-16 22:01:08, Oleg Nesterov wrote:
> > On 06/29, Michal Hocko wrote:
> > >
> > > > > +void mark_oom_victim(struct task_struct *tsk, struct mm_struct *mm)
> > > > >  {
> > > > >  	WARN_ON(oom_killer_disabled);
> > > > >  	/* OOM killer might race with memcg OOM */
> > > > >  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> > > > >  		return;
> > > > > +
> > > > >  	atomic_inc(&tsk->signal->oom_victims);
> > > > > +
> > > > > +	/* oom_mm is bound to the signal struct life time */
> > > > > +	if (!tsk->signal->oom_mm) {
> > > > > +		atomic_inc(&mm->mm_count);
> > > > > +		tsk->signal->oom_mm = mm;
> > > >
> > > > Looks racy, but it is not because we rely on oom_lock? Perhaps a comment
> > > > makes sense.
> > >
> > > mark_oom_victim will be called only for the current or under the
> > > task_lock so it should be stable. Except for...
> >
> > I meant that the code looks racy because 2 threads can see ->oom_mm == NULL
> > at the same time and in this case we have the extra atomic_inc(mm_count).
> > But I guess oom_lock saves us, so the code is correct but not clear.
>
> I have changed that to cmpxchg because lowmemory killer is called
> outside of oom_lock.

Hmm. I do not see anything in android/lowmemorykiller.c which can call
mark_oom_victim() ...

But if this is possible then perhaps we have more problems, note that the

	if (tsk == oom_reaper_list || tsk->oom_reaper_list)

check wake_oom_reaper() looks equally racy unless tsk is always current
without oom_lock.

And btw this check probably needs a comment too, we rely on SIGKILL sent
to this task before we do wake_oom_reaper(), or task_will_free_mem() == T.
Otherwise tsk->oom_reaper_list can be non-NULL if a victim forks before
exit, the child will have ->oom_reaper_list copied from parent by
dup_task_struct().

> > > Hmm, I didn't think about exec case. And I guess we have never cared
> > > about that race. We just select a task and then kill it.
> >
> > And I guess we want to fix this too, although this is not that important,
> > but this looks like a minor security problem.
>
> I am not sure I can see security implications but I agree this is less
> than optimal,

Well, just suppose that a memory hog execs a setuid application which does
something important, then we can kill it in some "inconsistent" state. Say,
after it created a file-lock which blocks other instances.

> albeit not critical. Killing a young process which didn't
> have much time to do a useful work doesn't seem that critical.

Yes, agreed, this is minor and very unlikely.

> > And this is another indication that almost everything oom-kill.c does with
> > task_struct is wrong ;) Ideally It should only use task_struct to send the
> > SIGKILL, and now that we kill all users of victim->mm we can hopefully do
> > this later.
>
> Hmm, so you think we should do s@victim@mm_victim@ and then do the
> for_each_process loop to kill all the tasks sharing that mm and kill
> them? We are doing that already so it doesn't sound that bad...

Yes, exactlty. But of course I am not sure about details.

>
> > Btw, do we still need this list_for_each_entry(child, &t->children, sibling)
> > loop in oom_kill_process() ?
>
> Well, to be honest, I don't know. This is a heuristic we have been doing
> for a long time. I do not know how many times it really matters. It can
> even be harmful in loads where children are created in the same pace OOM
> killer is killing them. Not sure how likely is that though...

And it is not clear to me why "child_points > victim_points" can be true if
the victim was chosen by select_bad_process() (to simplify the discussion,
lets ignore has_intersects_mems_allowed/etc).

> Let me think whether we can do something about that.

Perhaps it only makes sense if the caller is out_of_memory() ?  I mean the
sysctl_oom_kill_allocating_task branch. In this case it would nice to move
this list_for_each_entry(children) into another helper.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-30  8:07                   ` Michal Hocko
@ 2016-07-03 13:24                     ` Oleg Nesterov
  0 siblings, 0 replies; 34+ messages in thread
From: Oleg Nesterov @ 2016-07-03 13:24 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, linux-mm, vdavydov, rientjes

On 06/30, Michal Hocko wrote:
>
> On Wed 29-06-16 22:14:09, Oleg Nesterov wrote:
> > On 06/28, Michal Hocko wrote:
> > >
> > >
> > > Could you point me to where it depends on that? I mean if we are past
> > > exit_mm then we have unmapped the address space most probably but why
> > > should we care about that in the scheduler? There shouldn't be any
> > > further access to the address space by that point. I can see that
> > > context_switch() checks task->mm but it should just work when it sees it
> > > non NULL, right?
> >
> > But who will do the final mmdrop() then? I am not saying this is impossible
> > to change, say we do this in finish_task_switch(TASK_DEAD) or even in
> > free_task(), but we do not want this?
>
> I thought it could be done somewhere in release_task after we unhash
> the process

No, we can't do this. Note that release_task() can be called right after
exit_notify() by its parent ot by the exiting thread itself. It can still
run after that and it needs ->active_mm.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-06-30 10:51                         ` Tetsuo Handa
  2016-06-30 11:21                           ` Michal Hocko
@ 2016-07-03 13:32                           ` Oleg Nesterov
  1 sibling, 0 replies; 34+ messages in thread
From: Oleg Nesterov @ 2016-07-03 13:32 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, linux-mm, vdavydov, rientjes

On 06/30, Tetsuo Handa wrote:
>
> Michal Hocko wrote:
> > On Wed 29-06-16 22:01:08, Oleg Nesterov wrote:
>
> > > Btw, do we still need this list_for_each_entry(child, &t->children, sibling)
> > > loop in oom_kill_process() ?
> >
> > Well, to be honest, I don't know. This is a heuristic we have been doing
> > for a long time. I do not know how many times it really matters. It can
> > even be harmful in loads where children are created in the same pace OOM
> > killer is killing them. Not sure how likely is that though...
> > Let me think whether we can do something about that.
>
> I'm using that behavior in order to test almost OOM situation. ;)

Can you explain why do we want this behaviour?

Except, again, sysctl_oom_kill_allocating_task, see my reply to Michal.

> By the way, are you going to fix use_mm() race? Currently, we don't wake up
> OOM reaper if some kernel thread is holding a reference to that mm via
> use_mm(). But currently we can hit

Yes, and I already mention this race, and this is why I think we should not
skip kthreads.

> race. I think we need to make use_mm() fail after mark_oom_victim() is called.

Perhaps this makes sense anyway later, but I still think we do not really
care. I'll write another email...

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-07-03 13:21                         ` Oleg Nesterov
@ 2016-07-07 11:51                           ` Michal Hocko
  2016-07-07 16:42                             ` Oleg Nesterov
  0 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2016-07-07 11:51 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Tetsuo Handa, linux-mm, vdavydov, rientjes

On Sun 03-07-16 15:21:47, Oleg Nesterov wrote:
> On 06/30, Michal Hocko wrote:
> > On Wed 29-06-16 22:01:08, Oleg Nesterov wrote:
> > > On 06/29, Michal Hocko wrote:
> > > >
> > > > > > +void mark_oom_victim(struct task_struct *tsk, struct mm_struct *mm)
> > > > > >  {
> > > > > >  	WARN_ON(oom_killer_disabled);
> > > > > >  	/* OOM killer might race with memcg OOM */
> > > > > >  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> > > > > >  		return;
> > > > > > +
> > > > > >  	atomic_inc(&tsk->signal->oom_victims);
> > > > > > +
> > > > > > +	/* oom_mm is bound to the signal struct life time */
> > > > > > +	if (!tsk->signal->oom_mm) {
> > > > > > +		atomic_inc(&mm->mm_count);
> > > > > > +		tsk->signal->oom_mm = mm;
> > > > >
> > > > > Looks racy, but it is not because we rely on oom_lock? Perhaps a comment
> > > > > makes sense.
> > > >
> > > > mark_oom_victim will be called only for the current or under the
> > > > task_lock so it should be stable. Except for...
> > >
> > > I meant that the code looks racy because 2 threads can see ->oom_mm == NULL
> > > at the same time and in this case we have the extra atomic_inc(mm_count).
> > > But I guess oom_lock saves us, so the code is correct but not clear.
> >
> > I have changed that to cmpxchg because lowmemory killer is called
> > outside of oom_lock.
> 
> Hmm. I do not see anything in android/lowmemorykiller.c which can call
> mark_oom_victim() ...

I was just working on the pure mmotm tree and the lmk change was routed
via Greg. In short mark_oom_victim is no longer used out of oom proper.

> And btw this check probably needs a comment too, we rely on SIGKILL sent
> to this task before we do wake_oom_reaper(), or task_will_free_mem() == T.
> Otherwise tsk->oom_reaper_list can be non-NULL if a victim forks before
> exit, the child will have ->oom_reaper_list copied from parent by
> dup_task_struct().

Yes there is the dependency and probably worth a comment.

> > > > Hmm, I didn't think about exec case. And I guess we have never cared
> > > > about that race. We just select a task and then kill it.
> > >
> > > And I guess we want to fix this too, although this is not that important,
> > > but this looks like a minor security problem.
> >
> > I am not sure I can see security implications but I agree this is less
> > than optimal,
> 
> Well, just suppose that a memory hog execs a setuid application which does
> something important, then we can kill it in some "inconsistent" state. Say,
> after it created a file-lock which blocks other instances.

How that would differ from selecting and killing the suid application
right away?

[...]
> > > Btw, do we still need this list_for_each_entry(child, &t->children, sibling)
> > > loop in oom_kill_process() ?
> >
> > Well, to be honest, I don't know. This is a heuristic we have been doing
> > for a long time. I do not know how many times it really matters. It can
> > even be harmful in loads where children are created in the same pace OOM
> > killer is killing them. Not sure how likely is that though...
> 
> And it is not clear to me why "child_points > victim_points" can be true if
> the victim was chosen by select_bad_process() (to simplify the discussion,
> lets ignore has_intersects_mems_allowed/etc).

Because victim_points is a bit of misnomer. It doesn't have anything to
do with selected victim's score. victim_points is 0 before the loop.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE
  2016-07-07 11:51                           ` Michal Hocko
@ 2016-07-07 16:42                             ` Oleg Nesterov
  0 siblings, 0 replies; 34+ messages in thread
From: Oleg Nesterov @ 2016-07-07 16:42 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, linux-mm, vdavydov, rientjes

On 07/07, Michal Hocko wrote:
>
> On Sun 03-07-16 15:21:47, Oleg Nesterov wrote:
> > >
> > > I am not sure I can see security implications but I agree this is less
> > > than optimal,
> >
> > Well, just suppose that a memory hog execs a setuid application which does
> > something important, then we can kill it in some "inconsistent" state. Say,
> > after it created a file-lock which blocks other instances.
>
> How that would differ from selecting and killing the suid application
> right away?

in this case we at least check oom_score_adj/has_capability_noaudit(CAP_SYS_ADMIN)
before we decide to kill it.

> > And it is not clear to me why "child_points > victim_points" can be true if
> > the victim was chosen by select_bad_process() (to simplify the discussion,
> > lets ignore has_intersects_mems_allowed/etc).
>
> Because victim_points is a bit of misnomer. It doesn't have anything to
> do with selected victim's score. victim_points is 0 before the loop.

Ah, thanks. Yes I misread the code.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2016-07-07 16:42 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-24 11:02 [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE Tetsuo Handa
2016-06-24 12:39 ` Michal Hocko
2016-06-24 15:54   ` Tetsuo Handa
2016-06-24 22:42     ` Oleg Nesterov
2016-06-24 21:56 ` Oleg Nesterov
2016-06-25  5:44   ` Tetsuo Handa
2016-06-27  9:23     ` Michal Hocko
2016-06-27 10:36       ` Michal Hocko
2016-06-27 15:51         ` Oleg Nesterov
2016-06-27 16:06           ` Michal Hocko
2016-06-27 17:55             ` Oleg Nesterov
2016-06-28 10:19               ` Michal Hocko
2016-06-29  0:13                 ` Oleg Nesterov
2016-06-29  8:33                   ` Michal Hocko
2016-06-29 14:19                     ` Michal Hocko
2016-07-01 10:15                       ` Tetsuo Handa
2016-06-29 20:01                     ` Oleg Nesterov
2016-06-30  7:59                       ` Michal Hocko
2016-06-30 10:51                         ` Tetsuo Handa
2016-06-30 11:21                           ` Michal Hocko
2016-07-03 13:32                           ` Oleg Nesterov
2016-07-03 13:21                         ` Oleg Nesterov
2016-07-07 11:51                           ` Michal Hocko
2016-07-07 16:42                             ` Oleg Nesterov
2016-06-29 20:14                 ` Oleg Nesterov
2016-06-30  8:07                   ` Michal Hocko
2016-07-03 13:24                     ` Oleg Nesterov
2016-06-27 21:09       ` Oleg Nesterov
2016-06-28 10:26         ` Michal Hocko
2016-06-29 19:34           ` Oleg Nesterov
2016-06-27 20:40     ` Oleg Nesterov
2016-06-28 10:29       ` Michal Hocko
2016-06-29 20:24         ` Oleg Nesterov
2016-06-30  8:16           ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.