All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm,oom_reaper: Use try_oom_reaper() for reapability test.
@ 2016-04-14 10:56 Tetsuo Handa
  2016-04-14 10:56 ` [PATCH] mm,oom: Clarify reason to kill other threads sharing the vitctim's memory Tetsuo Handa
  2016-04-14 11:21 ` [PATCH] mm,oom_reaper: Use try_oom_reaper() for reapability test Michal Hocko
  0 siblings, 2 replies; 13+ messages in thread
From: Tetsuo Handa @ 2016-04-14 10:56 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Oleg Nesterov, David Rientjes, linux-mm, Tetsuo Handa

Assuming that try_oom_reaper() is correctly implemented, we should use
try_oom_reaper() for testing "whether the OOM reaper is allowed to reap
the OOM victim's memory" rather than "whether the OOM killer is allowed
to send SIGKILL to thread groups sharing the OOM victim's memory",
for the OOM reaper is allowed to reap the OOM victim's memory even if
that memory is shared by OOM_SCORE_ADJ_MIN but already-killed-or-exiting
thread groups.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 23 +++++++----------------
 1 file changed, 7 insertions(+), 16 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7098104..e78818d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -648,10 +648,6 @@ subsys_initcall(oom_init)
 static void try_oom_reaper(struct task_struct *tsk)
 {
 }
-
-static void wake_oom_reaper(struct task_struct *tsk)
-{
-}
 #endif
 
 /**
@@ -741,7 +737,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -833,22 +828,18 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-			/*
-			 * We cannot use oom_reaper for the mm shared by this
-			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
-			 */
-			can_oom_reap = false;
+		if (unlikely(p->flags & PF_KTHREAD))
 			continue;
-		}
+		if (is_global_init(p))
+			continue;
+		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+			continue;
+
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
+	try_oom_reaper(victim);
 
 	mmdrop(mm);
 	put_task_struct(victim);
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH] mm,oom: Clarify reason to kill other threads sharing the vitctim's memory.
  2016-04-14 10:56 [PATCH] mm,oom_reaper: Use try_oom_reaper() for reapability test Tetsuo Handa
@ 2016-04-14 10:56 ` Tetsuo Handa
  2016-04-14 11:31   ` Michal Hocko
  2016-04-14 11:21 ` [PATCH] mm,oom_reaper: Use try_oom_reaper() for reapability test Michal Hocko
  1 sibling, 1 reply; 13+ messages in thread
From: Tetsuo Handa @ 2016-04-14 10:56 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Oleg Nesterov, David Rientjes, linux-mm, Tetsuo Handa

Current comment for "Kill all user processes sharing victim->mm in other
thread groups" is not clear that doing so is a best effort avoidance.

I tried to update that logic along with TIF_MEMDIE for several times
but not yet accepted. Therefore, this patch changes only comment so that
we can apply now.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e78818d..43d0002 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -814,13 +814,28 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	task_unlock(victim);
 
 	/*
-	 * Kill all user processes sharing victim->mm in other thread groups, if
-	 * any.  They don't get access to memory reserves, though, to avoid
-	 * depletion of all memory.  This prevents mm->mmap_sem livelock when an
-	 * oom killed thread cannot exit because it requires the semaphore and
-	 * its contended by another thread trying to allocate memory itself.
-	 * That thread will now get access to memory reserves since it has a
-	 * pending fatal signal.
+	 * Kill all user processes sharing victim->mm in other thread groups,
+	 * if any. This reduces possibility of hitting mm->mmap_sem livelock
+	 * when an OOM victim thread cannot exit because it requires the
+	 * mm->mmap_sem for read at exit_mm() while another thread is trying
+	 * to allocate memory with that mm->mmap_sem held for write.
+	 *
+	 * Any thread except the victim thread itself which is killed by
+	 * this heuristic does not get access to memory reserves as of now,
+	 * but it will get access to memory reserves by calling out_of_memory()
+	 * or mem_cgroup_out_of_memory() since it has a pending fatal signal.
+	 *
+	 * Note that this heuristic is not perfect because it is possible that
+	 * a thread which shares victim->mm and is doing memory allocation with
+	 * victim->mm->mmap_sem held for write is marked as OOM_SCORE_ADJ_MIN.
+	 * Also, it is possible that a thread which shares victim->mm and is
+	 * doing memory allocation with victim->mm->mmap_sem held for write
+	 * (possibly the victim thread itself which got TIF_MEMDIE) is blocked
+	 * at unkillable locks from direct reclaim paths because nothing
+	 * prevents TIF_MEMDIE threads which already started direct reclaim
+	 * paths from being blocked at unkillable locks. In such cases, the
+	 * OOM reaper will be unable to reap victim->mm and we will need to
+	 * select a different OOM victim.
 	 */
 	rcu_read_lock();
 	for_each_process(p) {
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm,oom_reaper: Use try_oom_reaper() for reapability test.
  2016-04-14 10:56 [PATCH] mm,oom_reaper: Use try_oom_reaper() for reapability test Tetsuo Handa
  2016-04-14 10:56 ` [PATCH] mm,oom: Clarify reason to kill other threads sharing the vitctim's memory Tetsuo Handa
@ 2016-04-14 11:21 ` Michal Hocko
  2016-04-14 11:34   ` Tetsuo Handa
  2016-04-15 12:11   ` Tetsuo Handa
  1 sibling, 2 replies; 13+ messages in thread
From: Michal Hocko @ 2016-04-14 11:21 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Andrew Morton, Oleg Nesterov, David Rientjes, linux-mm

On Thu 14-04-16 19:56:30, Tetsuo Handa wrote:
> Assuming that try_oom_reaper() is correctly implemented, we should use
> try_oom_reaper() for testing "whether the OOM reaper is allowed to reap
> the OOM victim's memory" rather than "whether the OOM killer is allowed
> to send SIGKILL to thread groups sharing the OOM victim's memory",
> for the OOM reaper is allowed to reap the OOM victim's memory even if
> that memory is shared by OOM_SCORE_ADJ_MIN but already-killed-or-exiting
> thread groups.

So you prefer to crawl over the whole task list again just to catch a
really unlikely case where the OOM_SCORE_ADJ_MIN mm sharing task was
already exiting? Under which workload does this matter?

The patch seems correct I just do not see any point in it because I do
not think it handles any real life situation. I basically consider any
workload where only _certain_ thread(s) or process(es) sharing the mm have
OOM_SCORE_ADJ_MIN set as invalid. Why should we care about those? This
requires root to cripple the system. Or am I missing a valid
configuration where this would make any sense?

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/oom_kill.c | 23 +++++++----------------
>  1 file changed, 7 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 7098104..e78818d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -648,10 +648,6 @@ subsys_initcall(oom_init)
>  static void try_oom_reaper(struct task_struct *tsk)
>  {
>  }
> -
> -static void wake_oom_reaper(struct task_struct *tsk)
> -{
> -}
>  #endif
>  
>  /**
> @@ -741,7 +737,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	unsigned int victim_points = 0;
>  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>  					      DEFAULT_RATELIMIT_BURST);
> -	bool can_oom_reap = true;
>  
>  	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
> @@ -833,22 +828,18 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			continue;
>  		if (same_thread_group(p, victim))
>  			continue;
> -		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
> -		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> -			/*
> -			 * We cannot use oom_reaper for the mm shared by this
> -			 * process because it wouldn't get killed and so the
> -			 * memory might be still used.
> -			 */
> -			can_oom_reap = false;
> +		if (unlikely(p->flags & PF_KTHREAD))
>  			continue;
> -		}
> +		if (is_global_init(p))
> +			continue;
> +		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> +			continue;
> +
>  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>  	}
>  	rcu_read_unlock();
>  
> -	if (can_oom_reap)
> -		wake_oom_reaper(victim);
> +	try_oom_reaper(victim);
>  
>  	mmdrop(mm);
>  	put_task_struct(victim);
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm,oom: Clarify reason to kill other threads sharing the vitctim's memory.
  2016-04-14 10:56 ` [PATCH] mm,oom: Clarify reason to kill other threads sharing the vitctim's memory Tetsuo Handa
@ 2016-04-14 11:31   ` Michal Hocko
  2016-04-14 15:03     ` [PATCH] mm,oom: Clarify reason to kill other threads sharing thevitctim's memory Tetsuo Handa
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2016-04-14 11:31 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Andrew Morton, Oleg Nesterov, David Rientjes, linux-mm

On Thu 14-04-16 19:56:31, Tetsuo Handa wrote:
> Current comment for "Kill all user processes sharing victim->mm in other
> thread groups" is not clear that doing so is a best effort avoidance.
> 
> I tried to update that logic along with TIF_MEMDIE for several times
> but not yet accepted. Therefore, this patch changes only comment so that
> we can apply now.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/oom_kill.c | 29 ++++++++++++++++++++++-------
>  1 file changed, 22 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index e78818d..43d0002 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -814,13 +814,28 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	task_unlock(victim);
>  
>  	/*
> -	 * Kill all user processes sharing victim->mm in other thread groups, if
> -	 * any.  They don't get access to memory reserves, though, to avoid
> -	 * depletion of all memory.  This prevents mm->mmap_sem livelock when an
           ^^^^^^^^^
this was an useful information which you have dropped. Why?

> -	 * oom killed thread cannot exit because it requires the semaphore and
> -	 * its contended by another thread trying to allocate memory itself.
> -	 * That thread will now get access to memory reserves since it has a
> -	 * pending fatal signal.
> +	 * Kill all user processes sharing victim->mm in other thread groups,
> +	 * if any. This reduces possibility of hitting mm->mmap_sem livelock
> +	 * when an OOM victim thread cannot exit because it requires the
> +	 * mm->mmap_sem for read at exit_mm() while another thread is trying
> +	 * to allocate memory with that mm->mmap_sem held for write.
> +	 *
> +	 * Any thread except the victim thread itself which is killed by
> +	 * this heuristic does not get access to memory reserves as of now,
> +	 * but it will get access to memory reserves by calling out_of_memory()
> +	 * or mem_cgroup_out_of_memory() since it has a pending fatal signal.
> +	 *
> +	 * Note that this heuristic is not perfect because it is possible that
> +	 * a thread which shares victim->mm and is doing memory allocation with
> +	 * victim->mm->mmap_sem held for write is marked as OOM_SCORE_ADJ_MIN.

Is this really helpful? I would rather be explicit that we _do not care_
about these configurations. It is just PITA maintain and it doesn't make
any sense. So rather than trying to document all the weird thing that
might happen I would welcome a warning "mm shared with OOM_SCORE_ADJ_MIN
task. Something is broken in your configuration!"

> +	 * Also, it is possible that a thread which shares victim->mm and is
> +	 * doing memory allocation with victim->mm->mmap_sem held for write
> +	 * (possibly the victim thread itself which got TIF_MEMDIE) is blocked
> +	 * at unkillable locks from direct reclaim paths because nothing
> +	 * prevents TIF_MEMDIE threads which already started direct reclaim
> +	 * paths from being blocked at unkillable locks. In such cases, the
> +	 * OOM reaper will be unable to reap victim->mm and we will need to
> +	 * select a different OOM victim.

This is a more general problem and not related to this particular code.
Whenever we select a victim and call mark_oom_victim we hope it will
eventually get out of its kernel code path (unless it was running in the
userspace) so I am not sure this is placed properly.

>  	 */
>  	rcu_read_lock();
>  	for_each_process(p) {
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm,oom_reaper: Use try_oom_reaper() for reapability test.
  2016-04-14 11:21 ` [PATCH] mm,oom_reaper: Use try_oom_reaper() for reapability test Michal Hocko
@ 2016-04-14 11:34   ` Tetsuo Handa
  2016-04-14 12:01     ` Michal Hocko
  2016-04-15 12:11   ` Tetsuo Handa
  1 sibling, 1 reply; 13+ messages in thread
From: Tetsuo Handa @ 2016-04-14 11:34 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, oleg, rientjes, linux-mm

Michal Hocko wrote:
> On Thu 14-04-16 19:56:30, Tetsuo Handa wrote:
> > Assuming that try_oom_reaper() is correctly implemented, we should use
> > try_oom_reaper() for testing "whether the OOM reaper is allowed to reap
> > the OOM victim's memory" rather than "whether the OOM killer is allowed
> > to send SIGKILL to thread groups sharing the OOM victim's memory",
> > for the OOM reaper is allowed to reap the OOM victim's memory even if
> > that memory is shared by OOM_SCORE_ADJ_MIN but already-killed-or-exiting
> > thread groups.
> 
> So you prefer to crawl over the whole task list again just to catch a
> really unlikely case where the OOM_SCORE_ADJ_MIN mm sharing task was
> already exiting? Under which workload does this matter?
> 
> The patch seems correct I just do not see any point in it because I do
> not think it handles any real life situation. I basically consider any
> workload where only _certain_ thread(s) or process(es) sharing the mm have
> OOM_SCORE_ADJ_MIN set as invalid. Why should we care about those? This
> requires root to cripple the system. Or am I missing a valid
> configuration where this would make any sense?

Because __oom_reap_task() as of current linux.git marks only one of
thread groups as OOM_SCORE_ADJ_MIN and happily disables further reaping
(which I'm utilizing such behavior for catching bugs which occur under
almost OOM situation).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm,oom_reaper: Use try_oom_reaper() for reapability test.
  2016-04-14 11:34   ` Tetsuo Handa
@ 2016-04-14 12:01     ` Michal Hocko
  2016-04-14 12:34       ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2016-04-14 12:01 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, oleg, rientjes, linux-mm

On Thu 14-04-16 20:34:18, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > The patch seems correct I just do not see any point in it because I do
> > not think it handles any real life situation. I basically consider any
> > workload where only _certain_ thread(s) or process(es) sharing the mm have
> > OOM_SCORE_ADJ_MIN set as invalid. Why should we care about those? This
> > requires root to cripple the system. Or am I missing a valid
> > configuration where this would make any sense?
> 
> Because __oom_reap_task() as of current linux.git marks only one of
> thread groups as OOM_SCORE_ADJ_MIN and happily disables further reaping
> (which I'm utilizing such behavior for catching bugs which occur under
> almost OOM situation).

I am not really sure I understand what you mean here. Let me try. You
have N tasks sharing the same mm. OOM killer selects one of them and
kills it, grants TIF_MEMDIE and schedules it for oom_reaper. Now the oom
reaper handles that task and marks it OOM_SCORE_ADJ_MIN. Others will
have fatal_signal_pending without OOM_SCORE_ADJ_MIN. The shared mm was
already reaped so there is not much left we can do about it. What now?

A different question is whether it makes any sense to pick a task with
oom reaped mm as a new victim. This would happen if either the memory
is not reapable much or the mm was quite small. I agree that we do not
handle this case now same as we haven't before. An mm specific flag
would handle that I believe. Something like the following. Is this what
you are worried about or am I still missing your point?
---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index acfc32b30704..7bd0fa9db199 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -512,6 +512,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
+#define MMF_OOM_REAPED		21	/* mm has been already reaped */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 716759e3eaab..d5a4d08f2031 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -286,6 +286,13 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 		return OOM_SCAN_CONTINUE;
 
 	/*
+	 * mm of this task has already been reaped so it doesn't make any
+	 * sense to select it as a new oom victim.
+	 */
+	if (test_bit(MMF_OOM_REAPED, &task->mm->flags))
+		return OOM_SCAN_CONTINUE;
+
+	/*
 	 * If task is allocating a lot of memory and has been marked to be
 	 * killed first if it triggers an oom, then select it.
 	 */
@@ -513,7 +520,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 * This task can be safely ignored because we cannot do much more
 	 * to release its memory.
 	 */
-	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+	test_bit(MMF_OOM_REAPED, &mm->flags);
 out:
 	mmput(mm);
 	return ret;
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm,oom_reaper: Use try_oom_reaper() for reapability test.
  2016-04-14 12:01     ` Michal Hocko
@ 2016-04-14 12:34       ` Michal Hocko
  2016-04-14 14:01         ` Tetsuo Handa
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2016-04-14 12:34 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, oleg, rientjes, linux-mm

On Thu 14-04-16 14:01:06, Michal Hocko wrote:
[...]
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 716759e3eaab..d5a4d08f2031 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -286,6 +286,13 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  		return OOM_SCAN_CONTINUE;
>  
>  	/*
> +	 * mm of this task has already been reaped so it doesn't make any
> +	 * sense to select it as a new oom victim.
> +	 */
> +	if (test_bit(MMF_OOM_REAPED, &task->mm->flags))
> +		return OOM_SCAN_CONTINUE;

This will have to move to oom_badness to where we check for
OOM_SCORE_ADJ_MIN to catch the case where we try to sacrifice a child...

In the meantime I have generated a full patch and will repost it with
other oom reaper follow ups sometimes next week.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm,oom_reaper: Use try_oom_reaper() for reapability test.
  2016-04-14 12:34       ` Michal Hocko
@ 2016-04-14 14:01         ` Tetsuo Handa
  2016-04-14 14:30           ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Tetsuo Handa @ 2016-04-14 14:01 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, oleg, rientjes, linux-mm

Michal Hocko wrote:
> On Thu 14-04-16 20:34:18, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > The patch seems correct I just do not see any point in it because I do
> > > not think it handles any real life situation. I basically consider any
> > > workload where only _certain_ thread(s) or process(es) sharing the mm have
> > > OOM_SCORE_ADJ_MIN set as invalid. Why should we care about those? This
> > > requires root to cripple the system. Or am I missing a valid
> > > configuration where this would make any sense?
> > 
> > Because __oom_reap_task() as of current linux.git marks only one of
> > thread groups as OOM_SCORE_ADJ_MIN and happily disables further reaping
> > (which I'm utilizing such behavior for catching bugs which occur under
> > almost OOM situation).
> 
> I am not really sure I understand what you mean here. Let me try. You
> have N tasks sharing the same mm. OOM killer selects one of them and
> kills it, grants TIF_MEMDIE and schedules it for oom_reaper. Now the oom
> reaper handles that task and marks it OOM_SCORE_ADJ_MIN. Others will
> have fatal_signal_pending without OOM_SCORE_ADJ_MIN. The shared mm was
> already reaped so there is not much left we can do about it. What now?

You finally understood what I mean here.

Say, there are TG1 and TG2 sharing the same mm which are marked as
OOM_SCORE_ADJ_MAX. First round of the OOM killer selects TG1 and sends
SIGKILL to TG1 and TG2. The OOM reaper reaps memory via TG1 and marks
TG1 as OOM_SCORE_ADJ_MIN and revokes TIF_MEMDIE from TG1. Then, next
round of the OOM killer selects TG2 and sends SIGKILL to TG1 and TG2.
But since TG1 is already marked as OOM_SCORE_ADJ_MIN by the OOM reaper,
the OOM reaper is not called.

This is a situation which the patch you show below will solve.

> 
> A different question is whether it makes any sense to pick a task with
> oom reaped mm as a new victim. This would happen if either the memory
> is not reapable much or the mm was quite small. I agree that we do not
> handle this case now same as we haven't before. An mm specific flag
> would handle that I believe. Something like the following. Is this what
> you are worried about or am I still missing your point?
> ---
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index acfc32b30704..7bd0fa9db199 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -512,6 +512,7 @@ static inline int get_dumpable(struct mm_struct *mm)
>  
>  #define MMF_HAS_UPROBES		19	/* has uprobes */
>  #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
> +#define MMF_OOM_REAPED		21	/* mm has been already reaped */
>  
>  #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 716759e3eaab..d5a4d08f2031 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -286,6 +286,13 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  		return OOM_SCAN_CONTINUE;
>  
>  	/*
> +	 * mm of this task has already been reaped so it doesn't make any
> +	 * sense to select it as a new oom victim.
> +	 */
> +	if (test_bit(MMF_OOM_REAPED, &task->mm->flags))

You checked for task->mm != NULL at previous line but nothing prevents
that task from setting task->mm = NULL before arriving at this line.

> +		return OOM_SCAN_CONTINUE;
> +
> +	/*
>  	 * If task is allocating a lot of memory and has been marked to be
>  	 * killed first if it triggers an oom, then select it.
>  	 */
> @@ -513,7 +520,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
>  	 * This task can be safely ignored because we cannot do much more
>  	 * to release its memory.
>  	 */
> -	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> +	test_bit(MMF_OOM_REAPED, &mm->flags);
>  out:
>  	mmput(mm);
>  	return ret;

Michal Hocko wrote:
> On Thu 14-04-16 14:01:06, Michal Hocko wrote:
> [...]
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 716759e3eaab..d5a4d08f2031 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -286,6 +286,13 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
> >  		return OOM_SCAN_CONTINUE;
> >  
> >  	/*
> > +	 * mm of this task has already been reaped so it doesn't make any
> > +	 * sense to select it as a new oom victim.
> > +	 */
> > +	if (test_bit(MMF_OOM_REAPED, &task->mm->flags))
> > +		return OOM_SCAN_CONTINUE;
> 
> This will have to move to oom_badness to where we check for
> OOM_SCORE_ADJ_MIN to catch the case where we try to sacrifice a child...

oom_badness() should return 0 if MMF_OOM_REAPED is set (please be careful
with race task->mm becoming NULL). But oom_scan_process_thread() should not
return OOM_SCAN_ABORT if one of threads in TG1 or TG2 still has TIF_MEMDIE
(because it is possible that one of threads in TG1 or TG2 gets TIF_MEMDIE
via the fatal_signal_pending(current) shortcut in out_of_memory()).

> 
> In the meantime I have generated a full patch and will repost it with
> other oom reaper follow ups sometimes next week.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm,oom_reaper: Use try_oom_reaper() for reapability test.
  2016-04-14 14:01         ` Tetsuo Handa
@ 2016-04-14 14:30           ` Michal Hocko
  0 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2016-04-14 14:30 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, oleg, rientjes, linux-mm

On Thu 14-04-16 23:01:41, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 14-04-16 20:34:18, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > [...]
> > > > The patch seems correct I just do not see any point in it because I do
> > > > not think it handles any real life situation. I basically consider any
> > > > workload where only _certain_ thread(s) or process(es) sharing the mm have
> > > > OOM_SCORE_ADJ_MIN set as invalid. Why should we care about those? This
> > > > requires root to cripple the system. Or am I missing a valid
> > > > configuration where this would make any sense?
> > > 
> > > Because __oom_reap_task() as of current linux.git marks only one of
> > > thread groups as OOM_SCORE_ADJ_MIN and happily disables further reaping
> > > (which I'm utilizing such behavior for catching bugs which occur under
> > > almost OOM situation).
> > 
> > I am not really sure I understand what you mean here. Let me try. You
> > have N tasks sharing the same mm. OOM killer selects one of them and
> > kills it, grants TIF_MEMDIE and schedules it for oom_reaper. Now the oom
> > reaper handles that task and marks it OOM_SCORE_ADJ_MIN. Others will
> > have fatal_signal_pending without OOM_SCORE_ADJ_MIN. The shared mm was
> > already reaped so there is not much left we can do about it. What now?
> 
> You finally understood what I mean here.

OK, good to know we are on the same page.

> Say, there are TG1 and TG2 sharing the same mm which are marked as
> OOM_SCORE_ADJ_MAX. First round of the OOM killer selects TG1 and sends
> SIGKILL to TG1 and TG2. The OOM reaper reaps memory via TG1 and marks
> TG1 as OOM_SCORE_ADJ_MIN and revokes TIF_MEMDIE from TG1. Then, next
> round of the OOM killer selects TG2 and sends SIGKILL to TG1 and TG2.
> But since TG1 is already marked as OOM_SCORE_ADJ_MIN by the OOM reaper,
> the OOM reaper is not called.

which doesn't matter because this mm has already been reaped and further
attempts are basically deemed to fail as well. This is the reason why I
completely failed to see your point previously. Because it is not the
oom reaper which makes the situation worse. We just never cared about
this possible case.

> This is a situation which the patch you show below will solve.

OK, great.

[...]

> Michal Hocko wrote:
> > On Thu 14-04-16 14:01:06, Michal Hocko wrote:
> > [...]
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 716759e3eaab..d5a4d08f2031 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -286,6 +286,13 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
> > >  		return OOM_SCAN_CONTINUE;
> > >  
> > >  	/*
> > > +	 * mm of this task has already been reaped so it doesn't make any
> > > +	 * sense to select it as a new oom victim.
> > > +	 */
> > > +	if (test_bit(MMF_OOM_REAPED, &task->mm->flags))
> > > +		return OOM_SCAN_CONTINUE;
> > 
> > This will have to move to oom_badness to where we check for
> > OOM_SCORE_ADJ_MIN to catch the case where we try to sacrifice a child...
> 
> oom_badness() should return 0 if MMF_OOM_REAPED is set (please be careful
> with race task->mm becoming NULL).

This is what I ended up with. Patch below for the reference. I plan to
repost with other 3 posted recently in one series sometimes next week
hopefully.

> But oom_scan_process_thread() should not
> return OOM_SCAN_ABORT if one of threads in TG1 or TG2 still has TIF_MEMDIE
> (because it is possible that one of threads in TG1 or TG2 gets TIF_MEMDIE
> via the fatal_signal_pending(current) shortcut in out_of_memory()).

This would be a separate patch again. I still have to think how to deal
with this case but the most straightforward thing to do would be to simply
disable those shortcuts for crosss process shared mm-s. They are just
too weird and I do not think we want to support all the potential corner
cases and dropping an optimistic heuristic in the name of overal sanity
sounds as a good compromise to me.

---

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm,oom: Clarify reason to kill other threads sharing thevitctim's memory.
  2016-04-14 11:31   ` Michal Hocko
@ 2016-04-14 15:03     ` Tetsuo Handa
  2016-04-14 15:18       ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Tetsuo Handa @ 2016-04-14 15:03 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, oleg, rientjes, linux-mm

Michal Hocko wrote:
> On Thu 14-04-16 19:56:31, Tetsuo Handa wrote:
> > Current comment for "Kill all user processes sharing victim->mm in other
> > thread groups" is not clear that doing so is a best effort avoidance.
> > 
> > I tried to update that logic along with TIF_MEMDIE for several times
> > but not yet accepted. Therefore, this patch changes only comment so that
> > we can apply now.
> > 
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > ---
> >  mm/oom_kill.c | 29 ++++++++++++++++++++++-------
> >  1 file changed, 22 insertions(+), 7 deletions(-)
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index e78818d..43d0002 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -814,13 +814,28 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  	task_unlock(victim);
> >  
> >  	/*
> > -	 * Kill all user processes sharing victim->mm in other thread groups, if
> > -	 * any.  They don't get access to memory reserves, though, to avoid
> > -	 * depletion of all memory.  This prevents mm->mmap_sem livelock when an
>            ^^^^^^^^^
> this was an useful information which you have dropped. Why?
> 

Because I don't think setting TIF_MEMDIE to all threads sharing the victim's
memory at oom_kill_process() increases the risk of depleting the memory
reserves, for TIF_MEMDIE helps only if that thread is doing memory allocation.
I explained it at
http://lkml.kernel.org/r/201603162016.EBJ05275.VHMFSOLJOFQtOF@I-love.SAKURA.ne.jp .

> > -	 * oom killed thread cannot exit because it requires the semaphore and
> > -	 * its contended by another thread trying to allocate memory itself.
> > -	 * That thread will now get access to memory reserves since it has a
> > -	 * pending fatal signal.
> > +	 * Kill all user processes sharing victim->mm in other thread groups,
> > +	 * if any. This reduces possibility of hitting mm->mmap_sem livelock
> > +	 * when an OOM victim thread cannot exit because it requires the
> > +	 * mm->mmap_sem for read at exit_mm() while another thread is trying
> > +	 * to allocate memory with that mm->mmap_sem held for write.
> > +	 *
> > +	 * Any thread except the victim thread itself which is killed by
> > +	 * this heuristic does not get access to memory reserves as of now,
> > +	 * but it will get access to memory reserves by calling out_of_memory()
> > +	 * or mem_cgroup_out_of_memory() since it has a pending fatal signal.
> > +	 *
> > +	 * Note that this heuristic is not perfect because it is possible that
> > +	 * a thread which shares victim->mm and is doing memory allocation with
> > +	 * victim->mm->mmap_sem held for write is marked as OOM_SCORE_ADJ_MIN.
> 
> Is this really helpful? I would rather be explicit that we _do not care_
> about these configurations. It is just PITA maintain and it doesn't make
> any sense. So rather than trying to document all the weird thing that
> might happen I would welcome a warning "mm shared with OOM_SCORE_ADJ_MIN
> task. Something is broken in your configuration!"

Would you please stop rejecting configurations which do not match your values?
The OOM killer provides a safety net against accidental memory usage.
A properly configured system should not call out_of_memory() from the beginning.
Systems you call properly configured should use panic_on_oom > 0.
What I'm asking for is a workaround for rescuing current users from unexplained
silent hangups.

> 
> > +	 * Also, it is possible that a thread which shares victim->mm and is
> > +	 * doing memory allocation with victim->mm->mmap_sem held for write
> > +	 * (possibly the victim thread itself which got TIF_MEMDIE) is blocked
> > +	 * at unkillable locks from direct reclaim paths because nothing
> > +	 * prevents TIF_MEMDIE threads which already started direct reclaim
> > +	 * paths from being blocked at unkillable locks. In such cases, the
> > +	 * OOM reaper will be unable to reap victim->mm and we will need to
> > +	 * select a different OOM victim.
> 
> This is a more general problem and not related to this particular code.
> Whenever we select a victim and call mark_oom_victim we hope it will
> eventually get out of its kernel code path (unless it was running in the
> userspace) so I am not sure this is placed properly.

To be able to act as a safety net, we should not ignore corner cases.
Please explain your approach for handling the slowpath.

> 
> >  	 */
> >  	rcu_read_lock();
> >  	for_each_process(p) {
> > -- 
> > 1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm,oom: Clarify reason to kill other threads sharing thevitctim's memory.
  2016-04-14 15:03     ` [PATCH] mm,oom: Clarify reason to kill other threads sharing thevitctim's memory Tetsuo Handa
@ 2016-04-14 15:18       ` Michal Hocko
  2016-04-14 21:59         ` [PATCH] mm,oom: Clarify reason to kill other threads sharing the vitctim's memory Tetsuo Handa
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2016-04-14 15:18 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, oleg, rientjes, linux-mm

On Fri 15-04-16 00:03:31, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > I would rather be explicit that we _do not care_
> > about these configurations. It is just PITA maintain and it doesn't make
> > any sense. So rather than trying to document all the weird thing that
> > might happen I would welcome a warning "mm shared with OOM_SCORE_ADJ_MIN
> > task. Something is broken in your configuration!"
> 
> Would you please stop rejecting configurations which do not match your values?

Can you point out a single real life example where the above
configuration would make a sense? This is not about _my_ values. This is
about general _sanity_. If two/more entities share the mm and they disagree
about their OOM priorities then something is clearly broken. Don't you think?
How can the OOM killer do anything sensible here? The API we have
created is broken because it allows broken configurations too easily. It
is too late to fix it though so we can only rely on admins to use it
sensibly.

So please try to step back and think about whether it actually make
sense to make the oom even more complex/confusing for something that
gives little (if any) sense.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm,oom: Clarify reason to kill other threads sharing the vitctim's memory.
  2016-04-14 15:18       ` Michal Hocko
@ 2016-04-14 21:59         ` Tetsuo Handa
  0 siblings, 0 replies; 13+ messages in thread
From: Tetsuo Handa @ 2016-04-14 21:59 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, oleg, rientjes, linux-mm

Michal Hocko wrote:
> On Fri 15-04-16 00:03:31, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > I would rather be explicit that we _do not care_
> > > about these configurations. It is just PITA maintain and it doesn't make
> > > any sense. So rather than trying to document all the weird thing that
> > > might happen I would welcome a warning "mm shared with OOM_SCORE_ADJ_MIN
> > > task. Something is broken in your configuration!"
> > 
> > Would you please stop rejecting configurations which do not match your values?
> 
> Can you point out a single real life example where the above
> configuration would make a sense? This is not about _my_ values. This is
> about general _sanity_. If two/more entities share the mm and they disagree
> about their OOM priorities then something is clearly broken. Don't you think?
> How can the OOM killer do anything sensible here? The API we have
> created is broken because it allows broken configurations too easily. It
> is too late to fix it though so we can only rely on admins to use it
> sensibly.

I explained it at http://lkml.kernel.org/r/201603152015.JAE86937.VFOLtQFOFJOSHM@I-love.SAKURA.ne.jp .
I don't do such usage does not mean nobody does such usage.

> 
> So please try to step back and think about whether it actually make
> sense to make the oom even more complex/confusing for something that
> gives little (if any) sense.

Syscalls respond with "your usage is invalid" (by returning -EINVAL)
than "ignore such usage and crash" (by triggering kernel panic).
Why the OOM killer cannot respond with "I need to kill a different victim"
than "ignore and silently hang up" ? Doing so with bounded wait is trivial
and also helps "Whenever we select a victim and call mark_oom_victim
we hope it will eventually get out of its kernel code path" problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm,oom_reaper: Use try_oom_reaper() for reapability test.
  2016-04-14 11:21 ` [PATCH] mm,oom_reaper: Use try_oom_reaper() for reapability test Michal Hocko
  2016-04-14 11:34   ` Tetsuo Handa
@ 2016-04-15 12:11   ` Tetsuo Handa
  1 sibling, 0 replies; 13+ messages in thread
From: Tetsuo Handa @ 2016-04-15 12:11 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, oleg, rientjes, linux-mm

Michal Hocko wrote:
> On Thu 14-04-16 19:56:30, Tetsuo Handa wrote:
> > Assuming that try_oom_reaper() is correctly implemented, we should use
> > try_oom_reaper() for testing "whether the OOM reaper is allowed to reap
> > the OOM victim's memory" rather than "whether the OOM killer is allowed
> > to send SIGKILL to thread groups sharing the OOM victim's memory",
> > for the OOM reaper is allowed to reap the OOM victim's memory even if
> > that memory is shared by OOM_SCORE_ADJ_MIN but already-killed-or-exiting
> > thread groups.
>
> So you prefer to crawl over the whole task list again just to catch a
> really unlikely case where the OOM_SCORE_ADJ_MIN mm sharing task was
> already exiting? Under which workload does this matter?
>
> The patch seems correct I just do not see any point in it because I do
> not think it handles any real life situation. I basically consider any
> workload where only _certain_ thread(s) or process(es) sharing the mm have
> OOM_SCORE_ADJ_MIN set as invalid. Why should we care about those? This
> requires root to cripple the system. Or am I missing a valid
> configuration where this would make any sense?

You think that this patch seems correct is sufficient. This patch is just
a preparation for applying the patches shown bottom on top of this patch.

Quoting from http://lkml.kernel.org/r/20160414113108.GE2850@dhcp22.suse.cz :
| > +	 * Also, it is possible that a thread which shares victim->mm and is
| > +	 * doing memory allocation with victim->mm->mmap_sem held for write
| > +	 * (possibly the victim thread itself which got TIF_MEMDIE) is blocked
| > +	 * at unkillable locks from direct reclaim paths because nothing
| > +	 * prevents TIF_MEMDIE threads which already started direct reclaim
| > +	 * paths from being blocked at unkillable locks. In such cases, the
| > +	 * OOM reaper will be unable to reap victim->mm and we will need to
| > +	 * select a different OOM victim.
|
| This is a more general problem and not related to this particular code.
| Whenever we select a victim and call mark_oom_victim we hope it will
| eventually get out of its kernel code path (unless it was running in the
| userspace) so I am not sure this is placed properly.

So, can we get away from "where only _certain_ thread(s) or process(es)
sharing the mm have OOM_SCORE_ADJ_MIN set as invalid" discussion?

Let's consider how to handle "TIF_MEMDIE thread can fall into unkillable
waits inside the direct reclaim paths while allocating memory with mmap_sem
held for write" problem.

Quoting from http://lkml.kernel.org/r/1459951996-12875-3-git-send-email-mhocko@kernel.org :
| If either the current task is already killed or PF_EXITING or a selected
| task is PF_EXITING then the oom killer is suppressed and so is the oom
| reaper. This patch adds try_oom_reaper which checks the given task
| and queues it for the oom reaper if that is safe to be done meaning
| that the task doesn't share the mm with an alive process.

So, you proposed making the OOM killer try to wake up the OOM reaper
whenever TIF_MEMDIE is set.

Quoting from http://lkml.kernel.org/r/1460452756-15491-1-git-send-email-mhocko@kernel.org :
| The check can still do better though. We shouldn't consider the task
| unless the whole thread group is going down. This is rather unlikely
| but not impossible. A single exiting thread would surely leave all the
| address space behind. If we are really unlucky it might get stuck on the
| exit path and keep its TIF_MEMDIE and so block the oom killer.

So, you proposed making sure that we don't use task_will_free_mem() shortcut
unless all threads in that thread group are going down. This change allows us
to mark such thread group as no longer suitable for OOM victims when the OOM
killer cannot make forward progress (rather than that TIF_MEMDIE thread as
no longer suitable for OOM victims), for you proposed making the OOM killer
try to wake up the OOM reaper whenever TIF_MEMDIE is set.

Quoting from http://lkml.kernel.org/r/201604082019.EDH52671.OJHQFMStOFLVOF@I-love.SAKURA.ne.jp :
| My perspective on the OOM reaper is to behave as a guaranteed unlocking
| mechanism than a preemptive memory reaping mechanism. Therefore, it is
| critically important that the OOM reaper kernel thread is always woken up
| and unlock TIF_MEMDIE some time later, even if it is known that the memory
| used by the caller of try_oom_reaper() is not reapable.

So, I propose making sure that we use bounded wait for TIF_MEMDIE threads.
And now, all parts for providing a guaranteed unlocking mechanism which
handles "TIF_MEMDIE thread can fall into unkillable waits inside the direct
reclaim paths while allocating memory with mmap_sem held for write" problem
are ready. Below patches implement that mechanism.

Since this approach is much deterministic compared to timer based approach,
I do hope you will accept this approach for handling the slowpath.
------------------------------------------------------------
>From 2d4fd6475ac9a0cb64855a7d6e1cf8a02569d8e8 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 15 Apr 2016 13:11:24 +0900
Subject: [PATCH] mm,oom_reaper: Use set_bit() for setting MMF_OOM_REAPED

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ecbad1e..35158b7 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -518,7 +518,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 * This task can be safely ignored because we cannot do much more
 	 * to release its memory.
 	 */
-	test_bit(MMF_OOM_REAPED, &mm->flags);
+	set_bit(MMF_OOM_REAPED, &mm->flags);
 out:
 	mmput(mm);
 	return ret;
-- 
1.8.3.1
------------------------------------------------------------
>From c38c5abf3431ebefbd945bb4cfb6067678b266c7 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 15 Apr 2016 13:07:56 +0900
Subject: [PATCH] mm,oom_reaper: Defer reapability test to the OOM reaper.

This patch defers testing whether the OOM victim's memory is reapable
till the OOM reaper successfully takes the victim's mmap_sem for reading
because

  Majority of out_of_memory() calls is expected to terminate victim
  processes within a second without using the OOM reaper. The OOM
  reaper needs to take over the role of making forward progress
  only when out_of_memory() cannot terminate the victim processes.
  Therefore, in most cases, we don't need to test whether the OOM
  victim's memory is reapable at oom_kill_process().

and

  But the OOM reaper cannot reap the victim's memory if mmap_sem is
  held for write. Therefore, we don't need to test whether the OOM
  victim's memory is reapable unless the OOM reaper can successfully
  take the victim's mmap_sem for reading.

. After the OOM reaper successfully took the victim's mmap_sem for
reading, the OOM reaper tests whether the OOM victim's memory is
reapable, and reaps that memory only if reapable.

By waking up the OOM reaper from mark_oom_victim(), it is guaranteed
that the OOM reaper is waken up whenever TIF_MEMDIE is set. And unless
the OOM reaper is blocked forever inside __oom_reap_task(), it is also
guaranteed that the OOM reaper can take next action for making forward
progress.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 33 ++++++++++++++-------------------
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c1e2816..ecbad1e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -446,6 +446,7 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
+static bool can_oom_reap(struct mm_struct *mm);
 
 static bool __oom_reap_task(struct task_struct *tsk)
 {
@@ -480,6 +481,12 @@ static bool __oom_reap_task(struct task_struct *tsk)
 		goto out;
 	}
 
+	if (!can_oom_reap(mm)) {
+		up_read(&mm->mmap_sem);
+		ret = false;
+		goto out;
+	}
+
 	tlb_gather_mmu(&tlb, mm, 0, -1);
 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
 		if (is_vm_hugetlb_page(vma))
@@ -594,27 +601,21 @@ static void wake_oom_reaper(struct task_struct *tsk)
 /* Check if we can reap the given task. This has to be called with stable
  * tsk->mm
  */
-static void try_oom_reaper(struct task_struct *tsk)
+static bool can_oom_reap(struct mm_struct *mm)
 {
-	struct mm_struct *mm = tsk->mm;
 	struct task_struct *p;
 
-	if (!mm)
-		return;
-
 	/*
 	 * There might be other threads/processes which are either not
 	 * dying or even not killable.
 	 */
-	if (atomic_read(&mm->mm_users) > 1) {
+	if (atomic_read(&mm->mm_users) > 2) {
 		rcu_read_lock();
 		for_each_process(p) {
 			bool exiting;
 
 			if (!process_shares_mm(p, mm))
 				continue;
-			if (same_thread_group(p, tsk))
-				continue;
 			if (fatal_signal_pending(p))
 				continue;
 
@@ -630,12 +631,11 @@ static void try_oom_reaper(struct task_struct *tsk)
 
 			/* Give up */
 			rcu_read_unlock();
-			return;
+			return false;
 		}
 		rcu_read_unlock();
 	}
-
-	wake_oom_reaper(tsk);
+	return true;
 }
 
 static int __init oom_init(void)
@@ -649,10 +649,6 @@ static int __init oom_init(void)
 	return 0;
 }
 subsys_initcall(oom_init)
-#else
-static void try_oom_reaper(struct task_struct *tsk)
-{
-}
 #endif
 
 /**
@@ -676,6 +672,9 @@ void mark_oom_victim(struct task_struct *tsk)
 	 */
 	__thaw_task(tsk);
 	atomic_inc(&oom_victims);
+#ifdef CONFIG_MMU
+	wake_oom_reaper(tsk);
+#endif
 }
 
 /**
@@ -750,7 +749,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	task_lock(p);
 	if (p->mm && task_will_free_mem(p)) {
 		mark_oom_victim(p);
-		try_oom_reaper(p);
 		task_unlock(p);
 		put_task_struct(p);
 		return;
@@ -844,8 +842,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	}
 	rcu_read_unlock();
 
-	try_oom_reaper(victim);
-
 	mmdrop(mm);
 	put_task_struct(victim);
 }
@@ -926,7 +922,6 @@ bool out_of_memory(struct oom_control *oc)
 	if (current->mm &&
 	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
 		return true;
 	}
 
-- 
1.8.3.1
------------------------------------------------------------
>From 00864f683d75fe1b125edb976b8563d67a85902b Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 15 Apr 2016 13:14:17 +0900
Subject: [PATCH] mm,oom_reaper: Do not allow selecting the same victim forever

If the OOM reaper cannot reap the OOM victim's memory for more than
a second (MAX_OOM_REAP_RETRIES retries with HZ/10 jiffies sleep),
we assume that the OOM reaper won't be able to make forward progress
and allow the OOM killer to select a new victim.

But clearing TIF_MEMDIE does not help making forward progress because
the OOM killer will select the same victim again. Therefore, this patch
excludes a thread group which the TIF_MEMDIE thread belongs to by marking
as no longer suitable as OOM victims. This way, we can handle the
"TIF_MEMDIE thread can fall into unkillable waits inside the direct
reclaim paths while allocating memory with mmap_sem held for write"
problem.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/sched.h |  2 ++
 mm/oom_kill.c         | 20 ++++++++++----------
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fa8a74d..fc44910 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -787,6 +787,8 @@ struct signal_struct {
 	 * oom
 	 */
 	bool oom_flag_origin;
+	/* Already OOM-killed but cannot terminate. Don't count on me. */
+	bool oom_skip_me;
 	short oom_score_adj;		/* OOM kill score adjustment */
 	short oom_score_adj_min;	/* OOM kill score adjustment min value.
 					 * Only settable by CAP_SYS_RESOURCE. */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 35158b7..0421986 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -167,7 +167,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	long points;
 	long adj;
 
-	if (oom_unkillable_task(p, memcg, nodemask))
+	if (oom_unkillable_task(p, memcg, nodemask) || p->signal->oom_skip_me)
 		return 0;
 
 	p = find_lock_task_mm(p);
@@ -179,8 +179,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	 * unkillable or have been already oom reaped.
 	 */
 	adj = (long)p->signal->oom_score_adj;
-	if (adj == OOM_SCORE_ADJ_MIN ||
-			test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
+	if (adj == OOM_SCORE_ADJ_MIN) {
 		task_unlock(p);
 		return 0;
 	}
@@ -276,7 +275,8 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc,
 enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 			struct task_struct *task, unsigned long totalpages)
 {
-	if (oom_unkillable_task(task, NULL, oc->nodemask))
+	if (oom_unkillable_task(task, NULL, oc->nodemask) ||
+	    task->signal->oom_skip_me)
 		return OOM_SCAN_CONTINUE;
 
 	/*
@@ -469,7 +469,8 @@ static bool __oom_reap_task(struct task_struct *tsk)
 		return true;
 
 	mm = p->mm;
-	if (!atomic_inc_not_zero(&mm->mm_users)) {
+	if (test_bit(MMF_OOM_REAPED, &mm->flags) ||
+	    !atomic_inc_not_zero(&mm->mm_users)) {
 		task_unlock(p);
 		return true;
 	}
@@ -547,12 +548,11 @@ static void oom_reap_task(struct task_struct *tsk)
 	}
 
 	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore or it is not a good candidate
-	 * for the oom victim right now because it cannot release its memory
-	 * itself nor by the oom reaper.
+	 * Tell the OOM killer to ignore this thread group, for the memory
+	 * used by this thread group is already reaped or cannot be reaped
+	 * due to mmap_sem contention.
 	 */
-	exit_oom_victim(tsk);
+	tsk->signal->oom_skip_me = true;
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-04-15 12:11 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-14 10:56 [PATCH] mm,oom_reaper: Use try_oom_reaper() for reapability test Tetsuo Handa
2016-04-14 10:56 ` [PATCH] mm,oom: Clarify reason to kill other threads sharing the vitctim's memory Tetsuo Handa
2016-04-14 11:31   ` Michal Hocko
2016-04-14 15:03     ` [PATCH] mm,oom: Clarify reason to kill other threads sharing thevitctim's memory Tetsuo Handa
2016-04-14 15:18       ` Michal Hocko
2016-04-14 21:59         ` [PATCH] mm,oom: Clarify reason to kill other threads sharing the vitctim's memory Tetsuo Handa
2016-04-14 11:21 ` [PATCH] mm,oom_reaper: Use try_oom_reaper() for reapability test Michal Hocko
2016-04-14 11:34   ` Tetsuo Handa
2016-04-14 12:01     ` Michal Hocko
2016-04-14 12:34       ` Michal Hocko
2016-04-14 14:01         ` Tetsuo Handa
2016-04-14 14:30           ` Michal Hocko
2016-04-15 12:11   ` Tetsuo Handa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.