All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-14 23:43 ` David Rientjes
  0 siblings, 0 replies; 70+ messages in thread
From: David Rientjes @ 2017-06-14 23:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, Tetsuo Handa, linux-mm, linux-kernel

If mm->mm_users is not incremented because it is already zero by the oom
reaper, meaning the final refcount has been dropped, do not set
MMF_OOM_SKIP prematurely.

__mmput() may not have had a chance to do exit_mmap() yet, so memory from
a previous oom victim is still mapped.  __mput() naturally requires no
references on mm->mm_users to do exit_mmap().

Without this, several processes can be oom killed unnecessarily and the
oom log can show an abundance of memory available if exit_mmap() is in
progress at the time the process is skipped.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -531,6 +531,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 					 NULL);
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
+	set_bit(MMF_OOM_SKIP, &mm->flags);
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),
@@ -562,7 +563,11 @@ static void oom_reap_task(struct task_struct *tsk)
 	if (attempts <= MAX_OOM_REAP_RETRIES)
 		goto done;
 
-
+	/*
+	 * Hide this mm from OOM killer because it cannot be reaped since
+	 * mm->mmap_sem cannot be acquired.
+	 */
+	set_bit(MMF_OOM_SKIP, &mm->flags);
 	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 		task_pid_nr(tsk), tsk->comm);
 	debug_show_all_locks();
@@ -570,12 +575,6 @@ static void oom_reap_task(struct task_struct *tsk)
 done:
 	tsk->oom_reaper_list = NULL;
 
-	/*
-	 * Hide this mm from OOM killer because it has been either reaped or
-	 * somebody can't call up_write(mmap_sem).
-	 */
-	set_bit(MMF_OOM_SKIP, &mm->flags);
-
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
 }

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-14 23:43 ` David Rientjes
  0 siblings, 0 replies; 70+ messages in thread
From: David Rientjes @ 2017-06-14 23:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, Tetsuo Handa, linux-mm, linux-kernel

If mm->mm_users is not incremented because it is already zero by the oom
reaper, meaning the final refcount has been dropped, do not set
MMF_OOM_SKIP prematurely.

__mmput() may not have had a chance to do exit_mmap() yet, so memory from
a previous oom victim is still mapped.  __mput() naturally requires no
references on mm->mm_users to do exit_mmap().

Without this, several processes can be oom killed unnecessarily and the
oom log can show an abundance of memory available if exit_mmap() is in
progress at the time the process is skipped.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -531,6 +531,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 					 NULL);
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
+	set_bit(MMF_OOM_SKIP, &mm->flags);
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),
@@ -562,7 +563,11 @@ static void oom_reap_task(struct task_struct *tsk)
 	if (attempts <= MAX_OOM_REAP_RETRIES)
 		goto done;
 
-
+	/*
+	 * Hide this mm from OOM killer because it cannot be reaped since
+	 * mm->mmap_sem cannot be acquired.
+	 */
+	set_bit(MMF_OOM_SKIP, &mm->flags);
 	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 		task_pid_nr(tsk), tsk->comm);
 	debug_show_all_locks();
@@ -570,12 +575,6 @@ static void oom_reap_task(struct task_struct *tsk)
 done:
 	tsk->oom_reaper_list = NULL;
 
-	/*
-	 * Hide this mm from OOM killer because it has been either reaped or
-	 * somebody can't call up_write(mmap_sem).
-	 */
-	set_bit(MMF_OOM_SKIP, &mm->flags);
-
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-14 23:43 ` David Rientjes
@ 2017-06-15 10:39   ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-15 10:39 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Tetsuo Handa, linux-mm, linux-kernel

On Wed 14-06-17 16:43:03, David Rientjes wrote:
> If mm->mm_users is not incremented because it is already zero by the oom
> reaper, meaning the final refcount has been dropped, do not set
> MMF_OOM_SKIP prematurely.
> 
> __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> a previous oom victim is still mapped.

true and do we have a _guarantee_ it will do it? E.g. can somebody block
exit_aio from completing? Or can somebody hold mmap_sem and thus block
ksm_exit resp. khugepaged_exit from completing? The reason why I was
conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
give a definitive answer to those questions. And we really _want_ to
have a guarantee of a forward progress here. Killing an additional
proecess is a price to pay and if that doesn't trigger normall it sounds
like a reasonable compromise to me.

> __mput() naturally requires no
> references on mm->mm_users to do exit_mmap().
> 
> Without this, several processes can be oom killed unnecessarily and the
> oom log can show an abundance of memory available if exit_mmap() is in
> progress at the time the process is skipped.

Have you seen this happening in the real life?

> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -531,6 +531,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  					 NULL);
>  	}
>  	tlb_finish_mmu(&tlb, 0, -1);
> +	set_bit(MMF_OOM_SKIP, &mm->flags);
>  	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
>  			task_pid_nr(tsk), tsk->comm,
>  			K(get_mm_counter(mm, MM_ANONPAGES)),
> @@ -562,7 +563,11 @@ static void oom_reap_task(struct task_struct *tsk)
>  	if (attempts <= MAX_OOM_REAP_RETRIES)
>  		goto done;
>  
> -
> +	/*
> +	 * Hide this mm from OOM killer because it cannot be reaped since
> +	 * mm->mmap_sem cannot be acquired.
> +	 */
> +	set_bit(MMF_OOM_SKIP, &mm->flags);
>  	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
>  		task_pid_nr(tsk), tsk->comm);
>  	debug_show_all_locks();
> @@ -570,12 +575,6 @@ static void oom_reap_task(struct task_struct *tsk)
>  done:
>  	tsk->oom_reaper_list = NULL;
>  
> -	/*
> -	 * Hide this mm from OOM killer because it has been either reaped or
> -	 * somebody can't call up_write(mmap_sem).
> -	 */
> -	set_bit(MMF_OOM_SKIP, &mm->flags);
> -
>  	/* Drop a reference taken by wake_oom_reaper */
>  	put_task_struct(tsk);
>  }

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-15 10:39   ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-15 10:39 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Tetsuo Handa, linux-mm, linux-kernel

On Wed 14-06-17 16:43:03, David Rientjes wrote:
> If mm->mm_users is not incremented because it is already zero by the oom
> reaper, meaning the final refcount has been dropped, do not set
> MMF_OOM_SKIP prematurely.
> 
> __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> a previous oom victim is still mapped.

true and do we have a _guarantee_ it will do it? E.g. can somebody block
exit_aio from completing? Or can somebody hold mmap_sem and thus block
ksm_exit resp. khugepaged_exit from completing? The reason why I was
conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
give a definitive answer to those questions. And we really _want_ to
have a guarantee of a forward progress here. Killing an additional
proecess is a price to pay and if that doesn't trigger normall it sounds
like a reasonable compromise to me.

> __mput() naturally requires no
> references on mm->mm_users to do exit_mmap().
> 
> Without this, several processes can be oom killed unnecessarily and the
> oom log can show an abundance of memory available if exit_mmap() is in
> progress at the time the process is skipped.

Have you seen this happening in the real life?

> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -531,6 +531,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  					 NULL);
>  	}
>  	tlb_finish_mmu(&tlb, 0, -1);
> +	set_bit(MMF_OOM_SKIP, &mm->flags);
>  	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
>  			task_pid_nr(tsk), tsk->comm,
>  			K(get_mm_counter(mm, MM_ANONPAGES)),
> @@ -562,7 +563,11 @@ static void oom_reap_task(struct task_struct *tsk)
>  	if (attempts <= MAX_OOM_REAP_RETRIES)
>  		goto done;
>  
> -
> +	/*
> +	 * Hide this mm from OOM killer because it cannot be reaped since
> +	 * mm->mmap_sem cannot be acquired.
> +	 */
> +	set_bit(MMF_OOM_SKIP, &mm->flags);
>  	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
>  		task_pid_nr(tsk), tsk->comm);
>  	debug_show_all_locks();
> @@ -570,12 +575,6 @@ static void oom_reap_task(struct task_struct *tsk)
>  done:
>  	tsk->oom_reaper_list = NULL;
>  
> -	/*
> -	 * Hide this mm from OOM killer because it has been either reaped or
> -	 * somebody can't call up_write(mmap_sem).
> -	 */
> -	set_bit(MMF_OOM_SKIP, &mm->flags);
> -
>  	/* Drop a reference taken by wake_oom_reaper */
>  	put_task_struct(tsk);
>  }

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-15 10:39   ` Michal Hocko
@ 2017-06-15 10:53     ` Tetsuo Handa
  -1 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-15 10:53 UTC (permalink / raw)
  To: mhocko, rientjes; +Cc: akpm, linux-mm, linux-kernel

Michal Hocko wrote:
> On Wed 14-06-17 16:43:03, David Rientjes wrote:
> > If mm->mm_users is not incremented because it is already zero by the oom
> > reaper, meaning the final refcount has been dropped, do not set
> > MMF_OOM_SKIP prematurely.
> > 
> > __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> > a previous oom victim is still mapped.
> 
> true and do we have a _guarantee_ it will do it? E.g. can somebody block
> exit_aio from completing? Or can somebody hold mmap_sem and thus block
> ksm_exit resp. khugepaged_exit from completing? The reason why I was
> conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> give a definitive answer to those questions. And we really _want_ to
> have a guarantee of a forward progress here. Killing an additional
> proecess is a price to pay and if that doesn't trigger normall it sounds
> like a reasonable compromise to me.

Right. If you want this patch, __oom_reap_task_mm() must not return true without
setting MMF_OOM_SKIP (in other words, return false if __oom_reap_task_mm()
does not set MMF_OOM_SKIP). The most important role of the OOM reaper is to
guarantee that the OOM killer is re-enabled within finite time, for __mmput()
cannot guarantee that MMF_OOM_SKIP is set within finite time.

> 
> > __mput() naturally requires no
> > references on mm->mm_users to do exit_mmap().
> > 
> > Without this, several processes can be oom killed unnecessarily and the
> > oom log can show an abundance of memory available if exit_mmap() is in
> > progress at the time the process is skipped.
> 
> Have you seen this happening in the real life?
> 
> > Signed-off-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-15 10:53     ` Tetsuo Handa
  0 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-15 10:53 UTC (permalink / raw)
  To: mhocko, rientjes; +Cc: akpm, linux-mm, linux-kernel

Michal Hocko wrote:
> On Wed 14-06-17 16:43:03, David Rientjes wrote:
> > If mm->mm_users is not incremented because it is already zero by the oom
> > reaper, meaning the final refcount has been dropped, do not set
> > MMF_OOM_SKIP prematurely.
> > 
> > __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> > a previous oom victim is still mapped.
> 
> true and do we have a _guarantee_ it will do it? E.g. can somebody block
> exit_aio from completing? Or can somebody hold mmap_sem and thus block
> ksm_exit resp. khugepaged_exit from completing? The reason why I was
> conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> give a definitive answer to those questions. And we really _want_ to
> have a guarantee of a forward progress here. Killing an additional
> proecess is a price to pay and if that doesn't trigger normall it sounds
> like a reasonable compromise to me.

Right. If you want this patch, __oom_reap_task_mm() must not return true without
setting MMF_OOM_SKIP (in other words, return false if __oom_reap_task_mm()
does not set MMF_OOM_SKIP). The most important role of the OOM reaper is to
guarantee that the OOM killer is re-enabled within finite time, for __mmput()
cannot guarantee that MMF_OOM_SKIP is set within finite time.

> 
> > __mput() naturally requires no
> > references on mm->mm_users to do exit_mmap().
> > 
> > Without this, several processes can be oom killed unnecessarily and the
> > oom log can show an abundance of memory available if exit_mmap() is in
> > progress at the time the process is skipped.
> 
> Have you seen this happening in the real life?
> 
> > Signed-off-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-15 10:53     ` Tetsuo Handa
@ 2017-06-15 11:01       ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-15 11:01 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, linux-mm, linux-kernel

On Thu 15-06-17 19:53:24, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 14-06-17 16:43:03, David Rientjes wrote:
> > > If mm->mm_users is not incremented because it is already zero by the oom
> > > reaper, meaning the final refcount has been dropped, do not set
> > > MMF_OOM_SKIP prematurely.
> > > 
> > > __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> > > a previous oom victim is still mapped.
> > 
> > true and do we have a _guarantee_ it will do it? E.g. can somebody block
> > exit_aio from completing? Or can somebody hold mmap_sem and thus block
> > ksm_exit resp. khugepaged_exit from completing? The reason why I was
> > conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> > give a definitive answer to those questions. And we really _want_ to
> > have a guarantee of a forward progress here. Killing an additional
> > proecess is a price to pay and if that doesn't trigger normall it sounds
> > like a reasonable compromise to me.
> 
> Right. If you want this patch, __oom_reap_task_mm() must not return true without
> setting MMF_OOM_SKIP (in other words, return false if __oom_reap_task_mm()
> does not set MMF_OOM_SKIP). The most important role of the OOM reaper is to
> guarantee that the OOM killer is re-enabled within finite time, for __mmput()
> cannot guarantee that MMF_OOM_SKIP is set within finite time.

An alternative would be to allow reaping and exit_mmap race. The unmap
part should just work I guess. We just have to be careful to not race
with free_pgtables and that shouldn't be too hard to implement (e.g.
(ab)use mmap_sem for write there). I haven't thought that through
completely though so I might miss something of course.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-15 11:01       ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-15 11:01 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, linux-mm, linux-kernel

On Thu 15-06-17 19:53:24, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 14-06-17 16:43:03, David Rientjes wrote:
> > > If mm->mm_users is not incremented because it is already zero by the oom
> > > reaper, meaning the final refcount has been dropped, do not set
> > > MMF_OOM_SKIP prematurely.
> > > 
> > > __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> > > a previous oom victim is still mapped.
> > 
> > true and do we have a _guarantee_ it will do it? E.g. can somebody block
> > exit_aio from completing? Or can somebody hold mmap_sem and thus block
> > ksm_exit resp. khugepaged_exit from completing? The reason why I was
> > conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> > give a definitive answer to those questions. And we really _want_ to
> > have a guarantee of a forward progress here. Killing an additional
> > proecess is a price to pay and if that doesn't trigger normall it sounds
> > like a reasonable compromise to me.
> 
> Right. If you want this patch, __oom_reap_task_mm() must not return true without
> setting MMF_OOM_SKIP (in other words, return false if __oom_reap_task_mm()
> does not set MMF_OOM_SKIP). The most important role of the OOM reaper is to
> guarantee that the OOM killer is re-enabled within finite time, for __mmput()
> cannot guarantee that MMF_OOM_SKIP is set within finite time.

An alternative would be to allow reaping and exit_mmap race. The unmap
part should just work I guess. We just have to be careful to not race
with free_pgtables and that shouldn't be too hard to implement (e.g.
(ab)use mmap_sem for write there). I haven't thought that through
completely though so I might miss something of course.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-15 11:01       ` Michal Hocko
@ 2017-06-15 11:32         ` Tetsuo Handa
  -1 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-15 11:32 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, akpm, linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 15-06-17 19:53:24, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Wed 14-06-17 16:43:03, David Rientjes wrote:
> > > > If mm->mm_users is not incremented because it is already zero by the oom
> > > > reaper, meaning the final refcount has been dropped, do not set
> > > > MMF_OOM_SKIP prematurely.
> > > > 
> > > > __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> > > > a previous oom victim is still mapped.
> > > 
> > > true and do we have a _guarantee_ it will do it? E.g. can somebody block
> > > exit_aio from completing? Or can somebody hold mmap_sem and thus block
> > > ksm_exit resp. khugepaged_exit from completing? The reason why I was
> > > conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> > > give a definitive answer to those questions. And we really _want_ to
> > > have a guarantee of a forward progress here. Killing an additional
> > > proecess is a price to pay and if that doesn't trigger normall it sounds
> > > like a reasonable compromise to me.
> > 
> > Right. If you want this patch, __oom_reap_task_mm() must not return true without
> > setting MMF_OOM_SKIP (in other words, return false if __oom_reap_task_mm()
> > does not set MMF_OOM_SKIP). The most important role of the OOM reaper is to
> > guarantee that the OOM killer is re-enabled within finite time, for __mmput()
> > cannot guarantee that MMF_OOM_SKIP is set within finite time.
> 
> An alternative would be to allow reaping and exit_mmap race. The unmap
> part should just work I guess. We just have to be careful to not race
> with free_pgtables and that shouldn't be too hard to implement (e.g.
> (ab)use mmap_sem for write there). I haven't thought that through
> completely though so I might miss something of course.

I think below one is simpler.

 mm/oom_kill.c | 31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925..c63f514 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -466,11 +466,10 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
+static void __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	bool ret = true;
 
 	/*
 	 * We have to make sure to not race with the victim exit path
@@ -488,10 +487,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	 */
 	mutex_lock(&oom_lock);
 
-	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
+	if (!down_read_trylock(&mm->mmap_sem))
 		goto unlock_oom;
-	}
 
 	/*
 	 * increase mm_users only after we know we will reap something so
@@ -531,6 +528,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 					 NULL);
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
+	set_bit(MMF_OOM_SKIP, &mm->flags);
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),
@@ -546,7 +544,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	mmput_async(mm);
 unlock_oom:
 	mutex_unlock(&oom_lock);
-	return ret;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
@@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
 	struct mm_struct *mm = tsk->signal->oom_mm;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
+	while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, &mm->flags)
+	       && attempts++ < MAX_OOM_REAP_RETRIES)
 		schedule_timeout_idle(HZ/10);
 
-	if (attempts <= MAX_OOM_REAP_RETRIES)
-		goto done;
-
-
-	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-		task_pid_nr(tsk), tsk->comm);
-	debug_show_all_locks();
-
-done:
-	tsk->oom_reaper_list = NULL;
-
 	/*
 	 * Hide this mm from OOM killer because it has been either reaped or
 	 * somebody can't call up_write(mmap_sem).
 	 */
-	set_bit(MMF_OOM_SKIP, &mm->flags);
+	if (!test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
+		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+			task_pid_nr(tsk), tsk->comm);
+		debug_show_all_locks();
+	}
+
+	tsk->oom_reaper_list = NULL;
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-15 11:32         ` Tetsuo Handa
  0 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-15 11:32 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, akpm, linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 15-06-17 19:53:24, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Wed 14-06-17 16:43:03, David Rientjes wrote:
> > > > If mm->mm_users is not incremented because it is already zero by the oom
> > > > reaper, meaning the final refcount has been dropped, do not set
> > > > MMF_OOM_SKIP prematurely.
> > > > 
> > > > __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> > > > a previous oom victim is still mapped.
> > > 
> > > true and do we have a _guarantee_ it will do it? E.g. can somebody block
> > > exit_aio from completing? Or can somebody hold mmap_sem and thus block
> > > ksm_exit resp. khugepaged_exit from completing? The reason why I was
> > > conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> > > give a definitive answer to those questions. And we really _want_ to
> > > have a guarantee of a forward progress here. Killing an additional
> > > proecess is a price to pay and if that doesn't trigger normall it sounds
> > > like a reasonable compromise to me.
> > 
> > Right. If you want this patch, __oom_reap_task_mm() must not return true without
> > setting MMF_OOM_SKIP (in other words, return false if __oom_reap_task_mm()
> > does not set MMF_OOM_SKIP). The most important role of the OOM reaper is to
> > guarantee that the OOM killer is re-enabled within finite time, for __mmput()
> > cannot guarantee that MMF_OOM_SKIP is set within finite time.
> 
> An alternative would be to allow reaping and exit_mmap race. The unmap
> part should just work I guess. We just have to be careful to not race
> with free_pgtables and that shouldn't be too hard to implement (e.g.
> (ab)use mmap_sem for write there). I haven't thought that through
> completely though so I might miss something of course.

I think below one is simpler.

 mm/oom_kill.c | 31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925..c63f514 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -466,11 +466,10 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
+static void __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	bool ret = true;
 
 	/*
 	 * We have to make sure to not race with the victim exit path
@@ -488,10 +487,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	 */
 	mutex_lock(&oom_lock);
 
-	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
+	if (!down_read_trylock(&mm->mmap_sem))
 		goto unlock_oom;
-	}
 
 	/*
 	 * increase mm_users only after we know we will reap something so
@@ -531,6 +528,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 					 NULL);
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
+	set_bit(MMF_OOM_SKIP, &mm->flags);
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),
@@ -546,7 +544,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	mmput_async(mm);
 unlock_oom:
 	mutex_unlock(&oom_lock);
-	return ret;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
@@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
 	struct mm_struct *mm = tsk->signal->oom_mm;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
+	while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, &mm->flags)
+	       && attempts++ < MAX_OOM_REAP_RETRIES)
 		schedule_timeout_idle(HZ/10);
 
-	if (attempts <= MAX_OOM_REAP_RETRIES)
-		goto done;
-
-
-	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-		task_pid_nr(tsk), tsk->comm);
-	debug_show_all_locks();
-
-done:
-	tsk->oom_reaper_list = NULL;
-
 	/*
 	 * Hide this mm from OOM killer because it has been either reaped or
 	 * somebody can't call up_write(mmap_sem).
 	 */
-	set_bit(MMF_OOM_SKIP, &mm->flags);
+	if (!test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
+		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+			task_pid_nr(tsk), tsk->comm);
+		debug_show_all_locks();
+	}
+
+	tsk->oom_reaper_list = NULL;
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-15 11:32         ` Tetsuo Handa
@ 2017-06-15 12:03           ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-15 12:03 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, linux-mm, linux-kernel

On Thu 15-06-17 20:32:39, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > An alternative would be to allow reaping and exit_mmap race. The unmap
> > part should just work I guess. We just have to be careful to not race
> > with free_pgtables and that shouldn't be too hard to implement (e.g.
> > (ab)use mmap_sem for write there). I haven't thought that through
> > completely though so I might miss something of course.
> 
> I think below one is simpler.
[...]
> @@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
>  	struct mm_struct *mm = tsk->signal->oom_mm;
>  
>  	/* Retry the down_read_trylock(mmap_sem) a few times */
> -	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
> +	while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, &mm->flags)
> +	       && attempts++ < MAX_OOM_REAP_RETRIES)
>  		schedule_timeout_idle(HZ/10);
>  
> -	if (attempts <= MAX_OOM_REAP_RETRIES)
> -		goto done;
> -
> -
> -	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> -		task_pid_nr(tsk), tsk->comm);
> -	debug_show_all_locks();
> -
> -done:
> -	tsk->oom_reaper_list = NULL;
> -
>  	/*
>  	 * Hide this mm from OOM killer because it has been either reaped or
>  	 * somebody can't call up_write(mmap_sem).
>  	 */
> -	set_bit(MMF_OOM_SKIP, &mm->flags);
> +	if (!test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> +		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> +			task_pid_nr(tsk), tsk->comm);
> +		debug_show_all_locks();
> +	}
> +

How does this _solve_ anything? Why would you even retry when you
_know_ that the reference count dropped to zero. It will never
increment. So the above is basically just schedule_timeout_idle(HZ/10) *
MAX_OOM_REAP_RETRIES before we set MMF_OOM_SKIP. This might be enough
for victim to finish the exit_mmap but it is more a hack^Wworkround
than anything else. You could very well do the sleep without any
obfuscation...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-15 12:03           ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-15 12:03 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, linux-mm, linux-kernel

On Thu 15-06-17 20:32:39, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > An alternative would be to allow reaping and exit_mmap race. The unmap
> > part should just work I guess. We just have to be careful to not race
> > with free_pgtables and that shouldn't be too hard to implement (e.g.
> > (ab)use mmap_sem for write there). I haven't thought that through
> > completely though so I might miss something of course.
> 
> I think below one is simpler.
[...]
> @@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
>  	struct mm_struct *mm = tsk->signal->oom_mm;
>  
>  	/* Retry the down_read_trylock(mmap_sem) a few times */
> -	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
> +	while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, &mm->flags)
> +	       && attempts++ < MAX_OOM_REAP_RETRIES)
>  		schedule_timeout_idle(HZ/10);
>  
> -	if (attempts <= MAX_OOM_REAP_RETRIES)
> -		goto done;
> -
> -
> -	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> -		task_pid_nr(tsk), tsk->comm);
> -	debug_show_all_locks();
> -
> -done:
> -	tsk->oom_reaper_list = NULL;
> -
>  	/*
>  	 * Hide this mm from OOM killer because it has been either reaped or
>  	 * somebody can't call up_write(mmap_sem).
>  	 */
> -	set_bit(MMF_OOM_SKIP, &mm->flags);
> +	if (!test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> +		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> +			task_pid_nr(tsk), tsk->comm);
> +		debug_show_all_locks();
> +	}
> +

How does this _solve_ anything? Why would you even retry when you
_know_ that the reference count dropped to zero. It will never
increment. So the above is basically just schedule_timeout_idle(HZ/10) *
MAX_OOM_REAP_RETRIES before we set MMF_OOM_SKIP. This might be enough
for victim to finish the exit_mmap but it is more a hack^Wworkround
than anything else. You could very well do the sleep without any
obfuscation...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-15 12:03           ` Michal Hocko
@ 2017-06-15 12:13             ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-15 12:13 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, linux-mm, linux-kernel

On Thu 15-06-17 14:03:35, Michal Hocko wrote:
> On Thu 15-06-17 20:32:39, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > An alternative would be to allow reaping and exit_mmap race. The unmap
> > > part should just work I guess. We just have to be careful to not race
> > > with free_pgtables and that shouldn't be too hard to implement (e.g.
> > > (ab)use mmap_sem for write there). I haven't thought that through
> > > completely though so I might miss something of course.
> > 
> > I think below one is simpler.
> [...]
> > @@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
> >  	struct mm_struct *mm = tsk->signal->oom_mm;
> >  
> >  	/* Retry the down_read_trylock(mmap_sem) a few times */
> > -	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
> > +	while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, &mm->flags)
> > +	       && attempts++ < MAX_OOM_REAP_RETRIES)
> >  		schedule_timeout_idle(HZ/10);
> >  
> > -	if (attempts <= MAX_OOM_REAP_RETRIES)
> > -		goto done;
> > -
> > -
> > -	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > -		task_pid_nr(tsk), tsk->comm);
> > -	debug_show_all_locks();
> > -
> > -done:
> > -	tsk->oom_reaper_list = NULL;
> > -
> >  	/*
> >  	 * Hide this mm from OOM killer because it has been either reaped or
> >  	 * somebody can't call up_write(mmap_sem).
> >  	 */
> > -	set_bit(MMF_OOM_SKIP, &mm->flags);
> > +	if (!test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> > +		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > +			task_pid_nr(tsk), tsk->comm);
> > +		debug_show_all_locks();
> > +	}
> > +
> 
> How does this _solve_ anything? Why would you even retry when you
> _know_ that the reference count dropped to zero. It will never
> increment. So the above is basically just schedule_timeout_idle(HZ/10) *
> MAX_OOM_REAP_RETRIES before we set MMF_OOM_SKIP.

Just to make myself more clear. The above assumes that the victim hasn't
passed exit_mmap and MMF_OOM_SKIP in __mmput. Which is the case we want to
address here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-15 12:13             ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-15 12:13 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, linux-mm, linux-kernel

On Thu 15-06-17 14:03:35, Michal Hocko wrote:
> On Thu 15-06-17 20:32:39, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > An alternative would be to allow reaping and exit_mmap race. The unmap
> > > part should just work I guess. We just have to be careful to not race
> > > with free_pgtables and that shouldn't be too hard to implement (e.g.
> > > (ab)use mmap_sem for write there). I haven't thought that through
> > > completely though so I might miss something of course.
> > 
> > I think below one is simpler.
> [...]
> > @@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
> >  	struct mm_struct *mm = tsk->signal->oom_mm;
> >  
> >  	/* Retry the down_read_trylock(mmap_sem) a few times */
> > -	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
> > +	while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, &mm->flags)
> > +	       && attempts++ < MAX_OOM_REAP_RETRIES)
> >  		schedule_timeout_idle(HZ/10);
> >  
> > -	if (attempts <= MAX_OOM_REAP_RETRIES)
> > -		goto done;
> > -
> > -
> > -	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > -		task_pid_nr(tsk), tsk->comm);
> > -	debug_show_all_locks();
> > -
> > -done:
> > -	tsk->oom_reaper_list = NULL;
> > -
> >  	/*
> >  	 * Hide this mm from OOM killer because it has been either reaped or
> >  	 * somebody can't call up_write(mmap_sem).
> >  	 */
> > -	set_bit(MMF_OOM_SKIP, &mm->flags);
> > +	if (!test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> > +		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > +			task_pid_nr(tsk), tsk->comm);
> > +		debug_show_all_locks();
> > +	}
> > +
> 
> How does this _solve_ anything? Why would you even retry when you
> _know_ that the reference count dropped to zero. It will never
> increment. So the above is basically just schedule_timeout_idle(HZ/10) *
> MAX_OOM_REAP_RETRIES before we set MMF_OOM_SKIP.

Just to make myself more clear. The above assumes that the victim hasn't
passed exit_mmap and MMF_OOM_SKIP in __mmput. Which is the case we want to
address here.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-15 11:01       ` Michal Hocko
@ 2017-06-15 12:20         ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-15 12:20 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, linux-mm, linux-kernel

On Thu 15-06-17 13:01:19, Michal Hocko wrote:
> On Thu 15-06-17 19:53:24, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Wed 14-06-17 16:43:03, David Rientjes wrote:
> > > > If mm->mm_users is not incremented because it is already zero by the oom
> > > > reaper, meaning the final refcount has been dropped, do not set
> > > > MMF_OOM_SKIP prematurely.
> > > > 
> > > > __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> > > > a previous oom victim is still mapped.
> > > 
> > > true and do we have a _guarantee_ it will do it? E.g. can somebody block
> > > exit_aio from completing? Or can somebody hold mmap_sem and thus block
> > > ksm_exit resp. khugepaged_exit from completing? The reason why I was
> > > conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> > > give a definitive answer to those questions. And we really _want_ to
> > > have a guarantee of a forward progress here. Killing an additional
> > > proecess is a price to pay and if that doesn't trigger normall it sounds
> > > like a reasonable compromise to me.
> > 
> > Right. If you want this patch, __oom_reap_task_mm() must not return true without
> > setting MMF_OOM_SKIP (in other words, return false if __oom_reap_task_mm()
> > does not set MMF_OOM_SKIP). The most important role of the OOM reaper is to
> > guarantee that the OOM killer is re-enabled within finite time, for __mmput()
> > cannot guarantee that MMF_OOM_SKIP is set within finite time.
> 
> An alternative would be to allow reaping and exit_mmap race. The unmap
> part should just work I guess. We just have to be careful to not race
> with free_pgtables and that shouldn't be too hard to implement (e.g.
> (ab)use mmap_sem for write there). I haven't thought that through
> completely though so I might miss something of course.

Just to illustrate what I've had in mind. This is completely untested
(not even compile tested) and it may be completely broken but let's try
to evaluate this approach.
---
diff --git a/mm/mmap.c b/mm/mmap.c
index 3bd5ecd20d4d..761ba1dc9ec6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2962,7 +2962,13 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
 
+	/*
+	 * oom reaper might race with exit_mmap so make sure we won't free
+	 * page tables under its feet
+	 */
+	down_write(&mm->mmap_sem);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
+	up_write(&mm->mmap_sem);
 	tlb_finish_mmu(&tlb, 0, -1);
 
 	/*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925e7826..3df464f0f48b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -494,16 +494,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	}
 
 	/*
-	 * increase mm_users only after we know we will reap something so
-	 * that the mmput_async is called only when we have reaped something
-	 * and delayed __mmput doesn't matter that much
-	 */
-	if (!mmget_not_zero(mm)) {
-		up_read(&mm->mmap_sem);
-		goto unlock_oom;
-	}
-
-	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
 	 * is no longer stable. No barriers really needed because unmapping
 	 * should imply barriers already and the reader would hit a page fault
@@ -537,13 +527,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
-
-	/*
-	 * Drop our reference but make sure the mmput slow path is called from a
-	 * different context because we shouldn't risk we get stuck there and
-	 * put the oom_reaper out of the way.
-	 */
-	mmput_async(mm);
 unlock_oom:
 	mutex_unlock(&oom_lock);
 	return ret;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-15 12:20         ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-15 12:20 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, linux-mm, linux-kernel

On Thu 15-06-17 13:01:19, Michal Hocko wrote:
> On Thu 15-06-17 19:53:24, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Wed 14-06-17 16:43:03, David Rientjes wrote:
> > > > If mm->mm_users is not incremented because it is already zero by the oom
> > > > reaper, meaning the final refcount has been dropped, do not set
> > > > MMF_OOM_SKIP prematurely.
> > > > 
> > > > __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> > > > a previous oom victim is still mapped.
> > > 
> > > true and do we have a _guarantee_ it will do it? E.g. can somebody block
> > > exit_aio from completing? Or can somebody hold mmap_sem and thus block
> > > ksm_exit resp. khugepaged_exit from completing? The reason why I was
> > > conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> > > give a definitive answer to those questions. And we really _want_ to
> > > have a guarantee of a forward progress here. Killing an additional
> > > proecess is a price to pay and if that doesn't trigger normall it sounds
> > > like a reasonable compromise to me.
> > 
> > Right. If you want this patch, __oom_reap_task_mm() must not return true without
> > setting MMF_OOM_SKIP (in other words, return false if __oom_reap_task_mm()
> > does not set MMF_OOM_SKIP). The most important role of the OOM reaper is to
> > guarantee that the OOM killer is re-enabled within finite time, for __mmput()
> > cannot guarantee that MMF_OOM_SKIP is set within finite time.
> 
> An alternative would be to allow reaping and exit_mmap race. The unmap
> part should just work I guess. We just have to be careful to not race
> with free_pgtables and that shouldn't be too hard to implement (e.g.
> (ab)use mmap_sem for write there). I haven't thought that through
> completely though so I might miss something of course.

Just to illustrate what I've had in mind. This is completely untested
(not even compile tested) and it may be completely broken but let's try
to evaluate this approach.
---
diff --git a/mm/mmap.c b/mm/mmap.c
index 3bd5ecd20d4d..761ba1dc9ec6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2962,7 +2962,13 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
 
+	/*
+	 * oom reaper might race with exit_mmap so make sure we won't free
+	 * page tables under its feet
+	 */
+	down_write(&mm->mmap_sem);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
+	up_write(&mm->mmap_sem);
 	tlb_finish_mmu(&tlb, 0, -1);
 
 	/*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925e7826..3df464f0f48b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -494,16 +494,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	}
 
 	/*
-	 * increase mm_users only after we know we will reap something so
-	 * that the mmput_async is called only when we have reaped something
-	 * and delayed __mmput doesn't matter that much
-	 */
-	if (!mmget_not_zero(mm)) {
-		up_read(&mm->mmap_sem);
-		goto unlock_oom;
-	}
-
-	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
 	 * is no longer stable. No barriers really needed because unmapping
 	 * should imply barriers already and the reader would hit a page fault
@@ -537,13 +527,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
-
-	/*
-	 * Drop our reference but make sure the mmput slow path is called from a
-	 * different context because we shouldn't risk we get stuck there and
-	 * put the oom_reaper out of the way.
-	 */
-	mmput_async(mm);
 unlock_oom:
 	mutex_unlock(&oom_lock);
 	return ret;
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-15 12:13             ` Michal Hocko
@ 2017-06-15 13:01               ` Tetsuo Handa
  -1 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-15 13:01 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, akpm, linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 15-06-17 14:03:35, Michal Hocko wrote:
> > On Thu 15-06-17 20:32:39, Tetsuo Handa wrote:
> > > @@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
> > >  	struct mm_struct *mm = tsk->signal->oom_mm;
> > >  
> > >  	/* Retry the down_read_trylock(mmap_sem) a few times */
> > > -	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
> > > +	while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, &mm->flags)
> > > +	       && attempts++ < MAX_OOM_REAP_RETRIES)
> > >  		schedule_timeout_idle(HZ/10);
> > >  
> > > -	if (attempts <= MAX_OOM_REAP_RETRIES)
> > > -		goto done;
> > > -
> > > -
> > > -	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > -		task_pid_nr(tsk), tsk->comm);
> > > -	debug_show_all_locks();
> > > -
> > > -done:
> > > -	tsk->oom_reaper_list = NULL;
> > > -
> > >  	/*
> > >  	 * Hide this mm from OOM killer because it has been either reaped or
> > >  	 * somebody can't call up_write(mmap_sem).
> > >  	 */
> > > -	set_bit(MMF_OOM_SKIP, &mm->flags);
> > > +	if (!test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> > > +		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > +			task_pid_nr(tsk), tsk->comm);
> > > +		debug_show_all_locks();
> > > +	}
> > > +
> > 
> > How does this _solve_ anything? Why would you even retry when you
> > _know_ that the reference count dropped to zero. It will never
> > increment. So the above is basically just schedule_timeout_idle(HZ/10) *
> > MAX_OOM_REAP_RETRIES before we set MMF_OOM_SKIP.

If the OOM reaper knows that mm->users == 0, it gives __mmput() some time
to "complete exit_mmap() etc. and set MMF_OOM_SKIP". If __mmput() released
some memory, subsequent OOM killer invocation is automatically avoided.
If __mmput() did not release some memory, let the OOM killer invoke again.

> 
> Just to make myself more clear. The above assumes that the victim hasn't
> passed exit_mmap and MMF_OOM_SKIP in __mmput. Which is the case we want to
> address here.

David is trying to avoid setting MMF_OOM_SKIP when the OOM reaper found that
mm->users == 0. But we must not wait forever because __mmput() might fail to
release some memory immediately. If __mmput() did not release some memory within
schedule_timeout_idle(HZ/10) * MAX_OOM_REAP_RETRIES sleep, let the OOM killer
invoke again. So, this is the case we want to address here, isn't it?

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-15 13:01               ` Tetsuo Handa
  0 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-15 13:01 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, akpm, linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 15-06-17 14:03:35, Michal Hocko wrote:
> > On Thu 15-06-17 20:32:39, Tetsuo Handa wrote:
> > > @@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
> > >  	struct mm_struct *mm = tsk->signal->oom_mm;
> > >  
> > >  	/* Retry the down_read_trylock(mmap_sem) a few times */
> > > -	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
> > > +	while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, &mm->flags)
> > > +	       && attempts++ < MAX_OOM_REAP_RETRIES)
> > >  		schedule_timeout_idle(HZ/10);
> > >  
> > > -	if (attempts <= MAX_OOM_REAP_RETRIES)
> > > -		goto done;
> > > -
> > > -
> > > -	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > -		task_pid_nr(tsk), tsk->comm);
> > > -	debug_show_all_locks();
> > > -
> > > -done:
> > > -	tsk->oom_reaper_list = NULL;
> > > -
> > >  	/*
> > >  	 * Hide this mm from OOM killer because it has been either reaped or
> > >  	 * somebody can't call up_write(mmap_sem).
> > >  	 */
> > > -	set_bit(MMF_OOM_SKIP, &mm->flags);
> > > +	if (!test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> > > +		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > +			task_pid_nr(tsk), tsk->comm);
> > > +		debug_show_all_locks();
> > > +	}
> > > +
> > 
> > How does this _solve_ anything? Why would you even retry when you
> > _know_ that the reference count dropped to zero. It will never
> > increment. So the above is basically just schedule_timeout_idle(HZ/10) *
> > MAX_OOM_REAP_RETRIES before we set MMF_OOM_SKIP.

If the OOM reaper knows that mm->users == 0, it gives __mmput() some time
to "complete exit_mmap() etc. and set MMF_OOM_SKIP". If __mmput() released
some memory, subsequent OOM killer invocation is automatically avoided.
If __mmput() did not release some memory, let the OOM killer invoke again.

> 
> Just to make myself more clear. The above assumes that the victim hasn't
> passed exit_mmap and MMF_OOM_SKIP in __mmput. Which is the case we want to
> address here.

David is trying to avoid setting MMF_OOM_SKIP when the OOM reaper found that
mm->users == 0. But we must not wait forever because __mmput() might fail to
release some memory immediately. If __mmput() did not release some memory within
schedule_timeout_idle(HZ/10) * MAX_OOM_REAP_RETRIES sleep, let the OOM killer
invoke again. So, this is the case we want to address here, isn't it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-15 13:01               ` Tetsuo Handa
@ 2017-06-15 13:22                 ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-15 13:22 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, linux-mm, linux-kernel

On Thu 15-06-17 22:01:33, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 15-06-17 14:03:35, Michal Hocko wrote:
> > > On Thu 15-06-17 20:32:39, Tetsuo Handa wrote:
> > > > @@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
> > > >  	struct mm_struct *mm = tsk->signal->oom_mm;
> > > >  
> > > >  	/* Retry the down_read_trylock(mmap_sem) a few times */
> > > > -	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
> > > > +	while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, &mm->flags)
> > > > +	       && attempts++ < MAX_OOM_REAP_RETRIES)
> > > >  		schedule_timeout_idle(HZ/10);
> > > >  
> > > > -	if (attempts <= MAX_OOM_REAP_RETRIES)
> > > > -		goto done;
> > > > -
> > > > -
> > > > -	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > > -		task_pid_nr(tsk), tsk->comm);
> > > > -	debug_show_all_locks();
> > > > -
> > > > -done:
> > > > -	tsk->oom_reaper_list = NULL;
> > > > -
> > > >  	/*
> > > >  	 * Hide this mm from OOM killer because it has been either reaped or
> > > >  	 * somebody can't call up_write(mmap_sem).
> > > >  	 */
> > > > -	set_bit(MMF_OOM_SKIP, &mm->flags);
> > > > +	if (!test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> > > > +		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > > +			task_pid_nr(tsk), tsk->comm);
> > > > +		debug_show_all_locks();
> > > > +	}
> > > > +
> > > 
> > > How does this _solve_ anything? Why would you even retry when you
> > > _know_ that the reference count dropped to zero. It will never
> > > increment. So the above is basically just schedule_timeout_idle(HZ/10) *
> > > MAX_OOM_REAP_RETRIES before we set MMF_OOM_SKIP.
> 
> If the OOM reaper knows that mm->users == 0, it gives __mmput() some time
> to "complete exit_mmap() etc. and set MMF_OOM_SKIP". If __mmput() released
> some memory, subsequent OOM killer invocation is automatically avoided.
> If __mmput() did not release some memory, let the OOM killer invoke again.
> 
> > 
> > Just to make myself more clear. The above assumes that the victim hasn't
> > passed exit_mmap and MMF_OOM_SKIP in __mmput. Which is the case we want to
> > address here.
> 
> David is trying to avoid setting MMF_OOM_SKIP when the OOM reaper found that
> mm->users == 0. But we must not wait forever because __mmput() might fail to
> release some memory immediately. If __mmput() did not release some memory within
> schedule_timeout_idle(HZ/10) * MAX_OOM_REAP_RETRIES sleep, let the OOM killer
> invoke again. So, this is the case we want to address here, isn't it?

And we are back with a timeout based approach... Sigh. Just imagine that
you have a really large process which will take some time to tear down.
While it frees memory that might be in a different oom domain. Now you
pretend to keep retrying and eventually give up to allow a new oom
victim from that oom domain.

If we want to handle oom victims with mm_users == 0 then let's do it
properly, please.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-15 13:22                 ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-15 13:22 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, linux-mm, linux-kernel

On Thu 15-06-17 22:01:33, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 15-06-17 14:03:35, Michal Hocko wrote:
> > > On Thu 15-06-17 20:32:39, Tetsuo Handa wrote:
> > > > @@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
> > > >  	struct mm_struct *mm = tsk->signal->oom_mm;
> > > >  
> > > >  	/* Retry the down_read_trylock(mmap_sem) a few times */
> > > > -	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
> > > > +	while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, &mm->flags)
> > > > +	       && attempts++ < MAX_OOM_REAP_RETRIES)
> > > >  		schedule_timeout_idle(HZ/10);
> > > >  
> > > > -	if (attempts <= MAX_OOM_REAP_RETRIES)
> > > > -		goto done;
> > > > -
> > > > -
> > > > -	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > > -		task_pid_nr(tsk), tsk->comm);
> > > > -	debug_show_all_locks();
> > > > -
> > > > -done:
> > > > -	tsk->oom_reaper_list = NULL;
> > > > -
> > > >  	/*
> > > >  	 * Hide this mm from OOM killer because it has been either reaped or
> > > >  	 * somebody can't call up_write(mmap_sem).
> > > >  	 */
> > > > -	set_bit(MMF_OOM_SKIP, &mm->flags);
> > > > +	if (!test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> > > > +		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > > +			task_pid_nr(tsk), tsk->comm);
> > > > +		debug_show_all_locks();
> > > > +	}
> > > > +
> > > 
> > > How does this _solve_ anything? Why would you even retry when you
> > > _know_ that the reference count dropped to zero. It will never
> > > increment. So the above is basically just schedule_timeout_idle(HZ/10) *
> > > MAX_OOM_REAP_RETRIES before we set MMF_OOM_SKIP.
> 
> If the OOM reaper knows that mm->users == 0, it gives __mmput() some time
> to "complete exit_mmap() etc. and set MMF_OOM_SKIP". If __mmput() released
> some memory, subsequent OOM killer invocation is automatically avoided.
> If __mmput() did not release some memory, let the OOM killer invoke again.
> 
> > 
> > Just to make myself more clear. The above assumes that the victim hasn't
> > passed exit_mmap and MMF_OOM_SKIP in __mmput. Which is the case we want to
> > address here.
> 
> David is trying to avoid setting MMF_OOM_SKIP when the OOM reaper found that
> mm->users == 0. But we must not wait forever because __mmput() might fail to
> release some memory immediately. If __mmput() did not release some memory within
> schedule_timeout_idle(HZ/10) * MAX_OOM_REAP_RETRIES sleep, let the OOM killer
> invoke again. So, this is the case we want to address here, isn't it?

And we are back with a timeout based approach... Sigh. Just imagine that
you have a really large process which will take some time to tear down.
While it frees memory that might be in a different oom domain. Now you
pretend to keep retrying and eventually give up to allow a new oom
victim from that oom domain.

If we want to handle oom victims with mm_users == 0 then let's do it
properly, please.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-15 10:39   ` Michal Hocko
@ 2017-06-15 21:26     ` David Rientjes
  -1 siblings, 0 replies; 70+ messages in thread
From: David Rientjes @ 2017-06-15 21:26 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Tetsuo Handa, linux-mm, linux-kernel

On Thu, 15 Jun 2017, Michal Hocko wrote:

> > If mm->mm_users is not incremented because it is already zero by the oom
> > reaper, meaning the final refcount has been dropped, do not set
> > MMF_OOM_SKIP prematurely.
> > 
> > __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> > a previous oom victim is still mapped.
> 
> true and do we have a _guarantee_ it will do it? E.g. can somebody block
> exit_aio from completing? Or can somebody hold mmap_sem and thus block
> ksm_exit resp. khugepaged_exit from completing? The reason why I was
> conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> give a definitive answer to those questions. And we really _want_ to
> have a guarantee of a forward progress here. Killing an additional
> proecess is a price to pay and if that doesn't trigger normall it sounds
> like a reasonable compromise to me.
> 

I have not seen any issues where __mmput() stalls and exit_mmap() fails to 
free its mapped memory once mm->mm_users has dropped to 0.

> > __mput() naturally requires no
> > references on mm->mm_users to do exit_mmap().
> > 
> > Without this, several processes can be oom killed unnecessarily and the
> > oom log can show an abundance of memory available if exit_mmap() is in
> > progress at the time the process is skipped.
> 
> Have you seen this happening in the real life?
> 

Yes, quite a bit in testing.

One oom kill shows the system to be oom:

[22999.488705] Node 0 Normal free:90484kB min:90500kB ...
[22999.488711] Node 1 Normal free:91536kB min:91948kB ...

followed up by one or more unnecessary oom kills showing the oom killer 
racing with memory freeing of the victim:

[22999.510329] Node 0 Normal free:229588kB min:90500kB ...
[22999.510334] Node 1 Normal free:600036kB min:91948kB ...

The patch is absolutely required for us to prevent continuous oom killing 
of processes after a single process has been oom killed and its memory is 
in the process of being freed.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-15 21:26     ` David Rientjes
  0 siblings, 0 replies; 70+ messages in thread
From: David Rientjes @ 2017-06-15 21:26 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Tetsuo Handa, linux-mm, linux-kernel

On Thu, 15 Jun 2017, Michal Hocko wrote:

> > If mm->mm_users is not incremented because it is already zero by the oom
> > reaper, meaning the final refcount has been dropped, do not set
> > MMF_OOM_SKIP prematurely.
> > 
> > __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> > a previous oom victim is still mapped.
> 
> true and do we have a _guarantee_ it will do it? E.g. can somebody block
> exit_aio from completing? Or can somebody hold mmap_sem and thus block
> ksm_exit resp. khugepaged_exit from completing? The reason why I was
> conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> give a definitive answer to those questions. And we really _want_ to
> have a guarantee of a forward progress here. Killing an additional
> proecess is a price to pay and if that doesn't trigger normall it sounds
> like a reasonable compromise to me.
> 

I have not seen any issues where __mmput() stalls and exit_mmap() fails to 
free its mapped memory once mm->mm_users has dropped to 0.

> > __mput() naturally requires no
> > references on mm->mm_users to do exit_mmap().
> > 
> > Without this, several processes can be oom killed unnecessarily and the
> > oom log can show an abundance of memory available if exit_mmap() is in
> > progress at the time the process is skipped.
> 
> Have you seen this happening in the real life?
> 

Yes, quite a bit in testing.

One oom kill shows the system to be oom:

[22999.488705] Node 0 Normal free:90484kB min:90500kB ...
[22999.488711] Node 1 Normal free:91536kB min:91948kB ...

followed up by one or more unnecessary oom kills showing the oom killer 
racing with memory freeing of the victim:

[22999.510329] Node 0 Normal free:229588kB min:90500kB ...
[22999.510334] Node 1 Normal free:600036kB min:91948kB ...

The patch is absolutely required for us to prevent continuous oom killing 
of processes after a single process has been oom killed and its memory is 
in the process of being freed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-15 13:01               ` Tetsuo Handa
@ 2017-06-15 21:37                 ` David Rientjes
  -1 siblings, 0 replies; 70+ messages in thread
From: David Rientjes @ 2017-06-15 21:37 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, akpm, linux-mm, linux-kernel

On Thu, 15 Jun 2017, Tetsuo Handa wrote:

> David is trying to avoid setting MMF_OOM_SKIP when the OOM reaper found that
> mm->users == 0.

Yes, because MMF_OOM_SKIP enables the oom killer to select another process 
to kill and will do so without the original victim's mm being able to 
undergo exit_mmap().  So now we kill two or more processes when one would 
have sufficied; I have seen up to four processes killed unnecessarily 
without this patch.

> But we must not wait forever because __mmput() might fail to
> release some memory immediately. If __mmput() did not release some memory within
> schedule_timeout_idle(HZ/10) * MAX_OOM_REAP_RETRIES sleep, let the OOM killer
> invoke again. So, this is the case we want to address here, isn't it?
> 

It is obviously a function of the number of threads that share the mm with 
the oom victim to determine how long would be a sensible amount of time to 
wait for __mmput() to even get a chance to be called, along with 
potentially allowing a non-zero number of those threads to allocate from 
memory reserves to allow them to eventually drop mm->mmap_sem to make 
forward progress.

I have not witnessed any thread stalling in __mmput() that prevents the 
mm's memory to be freed.  I have witnessed several processes oom killed 
unnecessarily for a single oom condition where before MMF_OOM_SKIP was 
introduced, a single oom kill would have sufficed.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-15 21:37                 ` David Rientjes
  0 siblings, 0 replies; 70+ messages in thread
From: David Rientjes @ 2017-06-15 21:37 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, akpm, linux-mm, linux-kernel

On Thu, 15 Jun 2017, Tetsuo Handa wrote:

> David is trying to avoid setting MMF_OOM_SKIP when the OOM reaper found that
> mm->users == 0.

Yes, because MMF_OOM_SKIP enables the oom killer to select another process 
to kill and will do so without the original victim's mm being able to 
undergo exit_mmap().  So now we kill two or more processes when one would 
have sufficied; I have seen up to four processes killed unnecessarily 
without this patch.

> But we must not wait forever because __mmput() might fail to
> release some memory immediately. If __mmput() did not release some memory within
> schedule_timeout_idle(HZ/10) * MAX_OOM_REAP_RETRIES sleep, let the OOM killer
> invoke again. So, this is the case we want to address here, isn't it?
> 

It is obviously a function of the number of threads that share the mm with 
the oom victim to determine how long would be a sensible amount of time to 
wait for __mmput() to even get a chance to be called, along with 
potentially allowing a non-zero number of those threads to allocate from 
memory reserves to allow them to eventually drop mm->mmap_sem to make 
forward progress.

I have not witnessed any thread stalling in __mmput() that prevents the 
mm's memory to be freed.  I have witnessed several processes oom killed 
unnecessarily for a single oom condition where before MMF_OOM_SKIP was 
introduced, a single oom kill would have sufficed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-15 21:26     ` David Rientjes
@ 2017-06-15 21:41       ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-15 21:41 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Tetsuo Handa, linux-mm, linux-kernel

On Thu 15-06-17 14:26:26, David Rientjes wrote:
> On Thu, 15 Jun 2017, Michal Hocko wrote:
> 
> > > If mm->mm_users is not incremented because it is already zero by the oom
> > > reaper, meaning the final refcount has been dropped, do not set
> > > MMF_OOM_SKIP prematurely.
> > > 
> > > __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> > > a previous oom victim is still mapped.
> > 
> > true and do we have a _guarantee_ it will do it? E.g. can somebody block
> > exit_aio from completing? Or can somebody hold mmap_sem and thus block
> > ksm_exit resp. khugepaged_exit from completing? The reason why I was
> > conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> > give a definitive answer to those questions. And we really _want_ to
> > have a guarantee of a forward progress here. Killing an additional
> > proecess is a price to pay and if that doesn't trigger normall it sounds
> > like a reasonable compromise to me.
> > 
> 
> I have not seen any issues where __mmput() stalls and exit_mmap() fails to 
> free its mapped memory once mm->mm_users has dropped to 0.
> 
> > > __mput() naturally requires no
> > > references on mm->mm_users to do exit_mmap().
> > > 
> > > Without this, several processes can be oom killed unnecessarily and the
> > > oom log can show an abundance of memory available if exit_mmap() is in
> > > progress at the time the process is skipped.
> > 
> > Have you seen this happening in the real life?
> > 
> 
> Yes, quite a bit in testing.
> 
> One oom kill shows the system to be oom:
> 
> [22999.488705] Node 0 Normal free:90484kB min:90500kB ...
> [22999.488711] Node 1 Normal free:91536kB min:91948kB ...
> 
> followed up by one or more unnecessary oom kills showing the oom killer 
> racing with memory freeing of the victim:
> 
> [22999.510329] Node 0 Normal free:229588kB min:90500kB ...
> [22999.510334] Node 1 Normal free:600036kB min:91948kB ...
> 
> The patch is absolutely required for us to prevent continuous oom killing 
> of processes after a single process has been oom killed and its memory is 
> in the process of being freed.

OK, could you play with the patch/idea suggested in
http://lkml.kernel.org/r/20170615122031.GL1486@dhcp22.suse.cz?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-15 21:41       ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-15 21:41 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Tetsuo Handa, linux-mm, linux-kernel

On Thu 15-06-17 14:26:26, David Rientjes wrote:
> On Thu, 15 Jun 2017, Michal Hocko wrote:
> 
> > > If mm->mm_users is not incremented because it is already zero by the oom
> > > reaper, meaning the final refcount has been dropped, do not set
> > > MMF_OOM_SKIP prematurely.
> > > 
> > > __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> > > a previous oom victim is still mapped.
> > 
> > true and do we have a _guarantee_ it will do it? E.g. can somebody block
> > exit_aio from completing? Or can somebody hold mmap_sem and thus block
> > ksm_exit resp. khugepaged_exit from completing? The reason why I was
> > conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> > give a definitive answer to those questions. And we really _want_ to
> > have a guarantee of a forward progress here. Killing an additional
> > proecess is a price to pay and if that doesn't trigger normall it sounds
> > like a reasonable compromise to me.
> > 
> 
> I have not seen any issues where __mmput() stalls and exit_mmap() fails to 
> free its mapped memory once mm->mm_users has dropped to 0.
> 
> > > __mput() naturally requires no
> > > references on mm->mm_users to do exit_mmap().
> > > 
> > > Without this, several processes can be oom killed unnecessarily and the
> > > oom log can show an abundance of memory available if exit_mmap() is in
> > > progress at the time the process is skipped.
> > 
> > Have you seen this happening in the real life?
> > 
> 
> Yes, quite a bit in testing.
> 
> One oom kill shows the system to be oom:
> 
> [22999.488705] Node 0 Normal free:90484kB min:90500kB ...
> [22999.488711] Node 1 Normal free:91536kB min:91948kB ...
> 
> followed up by one or more unnecessary oom kills showing the oom killer 
> racing with memory freeing of the victim:
> 
> [22999.510329] Node 0 Normal free:229588kB min:90500kB ...
> [22999.510334] Node 1 Normal free:600036kB min:91948kB ...
> 
> The patch is absolutely required for us to prevent continuous oom killing 
> of processes after a single process has been oom killed and its memory is 
> in the process of being freed.

OK, could you play with the patch/idea suggested in
http://lkml.kernel.org/r/20170615122031.GL1486@dhcp22.suse.cz?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-15 13:22                 ` Michal Hocko
@ 2017-06-15 21:43                   ` Tetsuo Handa
  -1 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-15 21:43 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, akpm, linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 15-06-17 22:01:33, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Thu 15-06-17 14:03:35, Michal Hocko wrote:
> > > > On Thu 15-06-17 20:32:39, Tetsuo Handa wrote:
> > > > > @@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
> > > > >  	struct mm_struct *mm = tsk->signal->oom_mm;
> > > > >  
> > > > >  	/* Retry the down_read_trylock(mmap_sem) a few times */
> > > > > -	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
> > > > > +	while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, &mm->flags)
> > > > > +	       && attempts++ < MAX_OOM_REAP_RETRIES)
> > > > >  		schedule_timeout_idle(HZ/10);
> > > > >  
> > > > > -	if (attempts <= MAX_OOM_REAP_RETRIES)
> > > > > -		goto done;
> > > > > -
> > > > > -
> > > > > -	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > > > -		task_pid_nr(tsk), tsk->comm);
> > > > > -	debug_show_all_locks();
> > > > > -
> > > > > -done:
> > > > > -	tsk->oom_reaper_list = NULL;
> > > > > -
> > > > >  	/*
> > > > >  	 * Hide this mm from OOM killer because it has been either reaped or
> > > > >  	 * somebody can't call up_write(mmap_sem).
> > > > >  	 */
> > > > > -	set_bit(MMF_OOM_SKIP, &mm->flags);
> > > > > +	if (!test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> > > > > +		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > > > +			task_pid_nr(tsk), tsk->comm);
> > > > > +		debug_show_all_locks();
> > > > > +	}
> > > > > +
> > > > 
> > > > How does this _solve_ anything? Why would you even retry when you
> > > > _know_ that the reference count dropped to zero. It will never
> > > > increment. So the above is basically just schedule_timeout_idle(HZ/10) *
> > > > MAX_OOM_REAP_RETRIES before we set MMF_OOM_SKIP.
> > 
> > If the OOM reaper knows that mm->users == 0, it gives __mmput() some time
> > to "complete exit_mmap() etc. and set MMF_OOM_SKIP". If __mmput() released
> > some memory, subsequent OOM killer invocation is automatically avoided.
> > If __mmput() did not release some memory, let the OOM killer invoke again.
> > 
> > > 
> > > Just to make myself more clear. The above assumes that the victim hasn't
> > > passed exit_mmap and MMF_OOM_SKIP in __mmput. Which is the case we want to
> > > address here.
> > 
> > David is trying to avoid setting MMF_OOM_SKIP when the OOM reaper found that
> > mm->users == 0. But we must not wait forever because __mmput() might fail to
> > release some memory immediately. If __mmput() did not release some memory within
> > schedule_timeout_idle(HZ/10) * MAX_OOM_REAP_RETRIES sleep, let the OOM killer
> > invoke again. So, this is the case we want to address here, isn't it?
> 
> And we are back with a timeout based approach... Sigh. Just imagine that
> you have a really large process which will take some time to tear down.
> While it frees memory that might be in a different oom domain. Now you
> pretend to keep retrying and eventually give up to allow a new oom
> victim from that oom domain.

We are already using timeout based approach at down_read_trylock(&mm->mmap_sem)
in __oom_reap_task_mm(). It is possible that down_read_trylock(&mm->mmap_sem)
succeeds if the OOM reaper waited for one more second, for the thread which is
holding mmap_sem for write could just be failing to get TIF_MEMDIE due to
oom_lock contention among unrelated threads, but we allow the OOM reaper to
give up after one second.

Even if the victim is a really large process which will take some time to tear
down """inside __mmput()""", subsequent OOM killer invocation will be _automatically
avoided_ if __mmput() released _some_ memory. Thus, giving up __mmput() after one
second as well as giving up down_read_trylock(&mm->mmap_sem) after one second is
reasonable.

> 
> If we want to handle oom victims with mm_users == 0 then let's do it
> properly, please.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-15 21:43                   ` Tetsuo Handa
  0 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-15 21:43 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, akpm, linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 15-06-17 22:01:33, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Thu 15-06-17 14:03:35, Michal Hocko wrote:
> > > > On Thu 15-06-17 20:32:39, Tetsuo Handa wrote:
> > > > > @@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
> > > > >  	struct mm_struct *mm = tsk->signal->oom_mm;
> > > > >  
> > > > >  	/* Retry the down_read_trylock(mmap_sem) a few times */
> > > > > -	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
> > > > > +	while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, &mm->flags)
> > > > > +	       && attempts++ < MAX_OOM_REAP_RETRIES)
> > > > >  		schedule_timeout_idle(HZ/10);
> > > > >  
> > > > > -	if (attempts <= MAX_OOM_REAP_RETRIES)
> > > > > -		goto done;
> > > > > -
> > > > > -
> > > > > -	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > > > -		task_pid_nr(tsk), tsk->comm);
> > > > > -	debug_show_all_locks();
> > > > > -
> > > > > -done:
> > > > > -	tsk->oom_reaper_list = NULL;
> > > > > -
> > > > >  	/*
> > > > >  	 * Hide this mm from OOM killer because it has been either reaped or
> > > > >  	 * somebody can't call up_write(mmap_sem).
> > > > >  	 */
> > > > > -	set_bit(MMF_OOM_SKIP, &mm->flags);
> > > > > +	if (!test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> > > > > +		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > > > +			task_pid_nr(tsk), tsk->comm);
> > > > > +		debug_show_all_locks();
> > > > > +	}
> > > > > +
> > > > 
> > > > How does this _solve_ anything? Why would you even retry when you
> > > > _know_ that the reference count dropped to zero. It will never
> > > > increment. So the above is basically just schedule_timeout_idle(HZ/10) *
> > > > MAX_OOM_REAP_RETRIES before we set MMF_OOM_SKIP.
> > 
> > If the OOM reaper knows that mm->users == 0, it gives __mmput() some time
> > to "complete exit_mmap() etc. and set MMF_OOM_SKIP". If __mmput() released
> > some memory, subsequent OOM killer invocation is automatically avoided.
> > If __mmput() did not release some memory, let the OOM killer invoke again.
> > 
> > > 
> > > Just to make myself more clear. The above assumes that the victim hasn't
> > > passed exit_mmap and MMF_OOM_SKIP in __mmput. Which is the case we want to
> > > address here.
> > 
> > David is trying to avoid setting MMF_OOM_SKIP when the OOM reaper found that
> > mm->users == 0. But we must not wait forever because __mmput() might fail to
> > release some memory immediately. If __mmput() did not release some memory within
> > schedule_timeout_idle(HZ/10) * MAX_OOM_REAP_RETRIES sleep, let the OOM killer
> > invoke again. So, this is the case we want to address here, isn't it?
> 
> And we are back with a timeout based approach... Sigh. Just imagine that
> you have a really large process which will take some time to tear down.
> While it frees memory that might be in a different oom domain. Now you
> pretend to keep retrying and eventually give up to allow a new oom
> victim from that oom domain.

We are already using timeout based approach at down_read_trylock(&mm->mmap_sem)
in __oom_reap_task_mm(). It is possible that down_read_trylock(&mm->mmap_sem)
succeeds if the OOM reaper waited for one more second, for the thread which is
holding mmap_sem for write could just be failing to get TIF_MEMDIE due to
oom_lock contention among unrelated threads, but we allow the OOM reaper to
give up after one second.

Even if the victim is a really large process which will take some time to tear
down """inside __mmput()""", subsequent OOM killer invocation will be _automatically
avoided_ if __mmput() released _some_ memory. Thus, giving up __mmput() after one
second as well as giving up down_read_trylock(&mm->mmap_sem) after one second is
reasonable.

> 
> If we want to handle oom victims with mm_users == 0 then let's do it
> properly, please.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-15 21:41       ` Michal Hocko
@ 2017-06-15 22:03         ` David Rientjes
  -1 siblings, 0 replies; 70+ messages in thread
From: David Rientjes @ 2017-06-15 22:03 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Tetsuo Handa, linux-mm, linux-kernel

On Thu, 15 Jun 2017, Michal Hocko wrote:

> > Yes, quite a bit in testing.
> > 
> > One oom kill shows the system to be oom:
> > 
> > [22999.488705] Node 0 Normal free:90484kB min:90500kB ...
> > [22999.488711] Node 1 Normal free:91536kB min:91948kB ...
> > 
> > followed up by one or more unnecessary oom kills showing the oom killer 
> > racing with memory freeing of the victim:
> > 
> > [22999.510329] Node 0 Normal free:229588kB min:90500kB ...
> > [22999.510334] Node 1 Normal free:600036kB min:91948kB ...
> > 
> > The patch is absolutely required for us to prevent continuous oom killing 
> > of processes after a single process has been oom killed and its memory is 
> > in the process of being freed.
> 
> OK, could you play with the patch/idea suggested in
> http://lkml.kernel.org/r/20170615122031.GL1486@dhcp22.suse.cz?
> 

I cannot, I am trying to unblock a stable kernel release to my production 
that is obviously fixed with this patch and cannot experiment with 
uncompiled and untested patches that introduce otherwise unnecessary 
locking into the __mmput() path and is based on speculation rather than 
hard data that __mmput() for some reason stalls for the oom victim's mm.  
I was hoping that this fix could make it in time for 4.12 since 4.12 kills 
1-4 processes unnecessarily for each oom condition and then can review any 
tested solution you may propose at a later time.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-15 22:03         ` David Rientjes
  0 siblings, 0 replies; 70+ messages in thread
From: David Rientjes @ 2017-06-15 22:03 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Tetsuo Handa, linux-mm, linux-kernel

On Thu, 15 Jun 2017, Michal Hocko wrote:

> > Yes, quite a bit in testing.
> > 
> > One oom kill shows the system to be oom:
> > 
> > [22999.488705] Node 0 Normal free:90484kB min:90500kB ...
> > [22999.488711] Node 1 Normal free:91536kB min:91948kB ...
> > 
> > followed up by one or more unnecessary oom kills showing the oom killer 
> > racing with memory freeing of the victim:
> > 
> > [22999.510329] Node 0 Normal free:229588kB min:90500kB ...
> > [22999.510334] Node 1 Normal free:600036kB min:91948kB ...
> > 
> > The patch is absolutely required for us to prevent continuous oom killing 
> > of processes after a single process has been oom killed and its memory is 
> > in the process of being freed.
> 
> OK, could you play with the patch/idea suggested in
> http://lkml.kernel.org/r/20170615122031.GL1486@dhcp22.suse.cz?
> 

I cannot, I am trying to unblock a stable kernel release to my production 
that is obviously fixed with this patch and cannot experiment with 
uncompiled and untested patches that introduce otherwise unnecessary 
locking into the __mmput() path and is based on speculation rather than 
hard data that __mmput() for some reason stalls for the oom victim's mm.  
I was hoping that this fix could make it in time for 4.12 since 4.12 kills 
1-4 processes unnecessarily for each oom condition and then can review any 
tested solution you may propose at a later time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-15 22:03         ` David Rientjes
@ 2017-06-15 22:12           ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-15 22:12 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Tetsuo Handa, linux-mm, linux-kernel

On Thu 15-06-17 15:03:17, David Rientjes wrote:
> On Thu, 15 Jun 2017, Michal Hocko wrote:
> 
> > > Yes, quite a bit in testing.
> > > 
> > > One oom kill shows the system to be oom:
> > > 
> > > [22999.488705] Node 0 Normal free:90484kB min:90500kB ...
> > > [22999.488711] Node 1 Normal free:91536kB min:91948kB ...
> > > 
> > > followed up by one or more unnecessary oom kills showing the oom killer 
> > > racing with memory freeing of the victim:
> > > 
> > > [22999.510329] Node 0 Normal free:229588kB min:90500kB ...
> > > [22999.510334] Node 1 Normal free:600036kB min:91948kB ...
> > > 
> > > The patch is absolutely required for us to prevent continuous oom killing 
> > > of processes after a single process has been oom killed and its memory is 
> > > in the process of being freed.
> > 
> > OK, could you play with the patch/idea suggested in
> > http://lkml.kernel.org/r/20170615122031.GL1486@dhcp22.suse.cz?
> > 
> 
> I cannot, I am trying to unblock a stable kernel release to my production 
> that is obviously fixed with this patch and cannot experiment with 
> uncompiled and untested patches that introduce otherwise unnecessary 
> locking into the __mmput() path and is based on speculation rather than 
> hard data that __mmput() for some reason stalls for the oom victim's mm.  
> I was hoping that this fix could make it in time for 4.12 since 4.12 kills 
> 1-4 processes unnecessarily for each oom condition and then can review any 
> tested solution you may propose at a later time.

I am sorry but I have really hard to make the oom reaper a reliable way
to stop all the potential oom lockups go away. I do not want to
reintroduce another potential lockup now. I also do not see why any
solution should be rushed into. I have proposed a way to go and unless
it is clear that this is not a way forward then I simply do not agree
with any partial workarounds or shortcuts.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-15 22:12           ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-15 22:12 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Tetsuo Handa, linux-mm, linux-kernel

On Thu 15-06-17 15:03:17, David Rientjes wrote:
> On Thu, 15 Jun 2017, Michal Hocko wrote:
> 
> > > Yes, quite a bit in testing.
> > > 
> > > One oom kill shows the system to be oom:
> > > 
> > > [22999.488705] Node 0 Normal free:90484kB min:90500kB ...
> > > [22999.488711] Node 1 Normal free:91536kB min:91948kB ...
> > > 
> > > followed up by one or more unnecessary oom kills showing the oom killer 
> > > racing with memory freeing of the victim:
> > > 
> > > [22999.510329] Node 0 Normal free:229588kB min:90500kB ...
> > > [22999.510334] Node 1 Normal free:600036kB min:91948kB ...
> > > 
> > > The patch is absolutely required for us to prevent continuous oom killing 
> > > of processes after a single process has been oom killed and its memory is 
> > > in the process of being freed.
> > 
> > OK, could you play with the patch/idea suggested in
> > http://lkml.kernel.org/r/20170615122031.GL1486@dhcp22.suse.cz?
> > 
> 
> I cannot, I am trying to unblock a stable kernel release to my production 
> that is obviously fixed with this patch and cannot experiment with 
> uncompiled and untested patches that introduce otherwise unnecessary 
> locking into the __mmput() path and is based on speculation rather than 
> hard data that __mmput() for some reason stalls for the oom victim's mm.  
> I was hoping that this fix could make it in time for 4.12 since 4.12 kills 
> 1-4 processes unnecessarily for each oom condition and then can review any 
> tested solution you may propose at a later time.

I am sorry but I have really hard to make the oom reaper a reliable way
to stop all the potential oom lockups go away. I do not want to
reintroduce another potential lockup now. I also do not see why any
solution should be rushed into. I have proposed a way to go and unless
it is clear that this is not a way forward then I simply do not agree
with any partial workarounds or shortcuts.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-15 22:12           ` Michal Hocko
@ 2017-06-15 22:42             ` David Rientjes
  -1 siblings, 0 replies; 70+ messages in thread
From: David Rientjes @ 2017-06-15 22:42 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Tetsuo Handa, linux-mm, linux-kernel

On Fri, 16 Jun 2017, Michal Hocko wrote:

> I am sorry but I have really hard to make the oom reaper a reliable way
> to stop all the potential oom lockups go away. I do not want to
> reintroduce another potential lockup now.

Please show where this "potential lockup" ever existed in a bug report or 
a testcase?  I have never seen __mmput() block when trying to free the 
memory it maps.

> I also do not see why any
> solution should be rushed into. I have proposed a way to go and unless
> it is clear that this is not a way forward then I simply do not agree
> with any partial workarounds or shortcuts.

This is not a shortcut, it is a bug fix.  4.12 kills 1-4 processes 
unnecessarily as a result of setting MMF_OOM_SKIP incorrectly before the 
mm's memory can be freed.  If you have not seen this issue before, which 
is why you asked if I ever observed it in practice, then you have not 
stress tested oom reaping.  It is very observable and reproducible.  I do 
not agree that adding additional and obscure locking into __mmput() is the 
solution to what is plainly and obviously fixed with this simple patch.

4.12 needs to stop killing 2-5 processes on every oom condition instead of 
1.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-15 22:42             ` David Rientjes
  0 siblings, 0 replies; 70+ messages in thread
From: David Rientjes @ 2017-06-15 22:42 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Tetsuo Handa, linux-mm, linux-kernel

On Fri, 16 Jun 2017, Michal Hocko wrote:

> I am sorry but I have really hard to make the oom reaper a reliable way
> to stop all the potential oom lockups go away. I do not want to
> reintroduce another potential lockup now.

Please show where this "potential lockup" ever existed in a bug report or 
a testcase?  I have never seen __mmput() block when trying to free the 
memory it maps.

> I also do not see why any
> solution should be rushed into. I have proposed a way to go and unless
> it is clear that this is not a way forward then I simply do not agree
> with any partial workarounds or shortcuts.

This is not a shortcut, it is a bug fix.  4.12 kills 1-4 processes 
unnecessarily as a result of setting MMF_OOM_SKIP incorrectly before the 
mm's memory can be freed.  If you have not seen this issue before, which 
is why you asked if I ever observed it in practice, then you have not 
stress tested oom reaping.  It is very observable and reproducible.  I do 
not agree that adding additional and obscure locking into __mmput() is the 
solution to what is plainly and obviously fixed with this simple patch.

4.12 needs to stop killing 2-5 processes on every oom condition instead of 
1.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-15 22:12           ` Michal Hocko
@ 2017-06-16  0:54             ` Tetsuo Handa
  -1 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-16  0:54 UTC (permalink / raw)
  To: Michal Hocko; +Cc: David Rientjes, Andrew Morton, linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 15-06-17 15:03:17, David Rientjes wrote:
> > On Thu, 15 Jun 2017, Michal Hocko wrote:
> > 
> > > > Yes, quite a bit in testing.
> > > > 
> > > > One oom kill shows the system to be oom:
> > > > 
> > > > [22999.488705] Node 0 Normal free:90484kB min:90500kB ...
> > > > [22999.488711] Node 1 Normal free:91536kB min:91948kB ...
> > > > 
> > > > followed up by one or more unnecessary oom kills showing the oom killer 
> > > > racing with memory freeing of the victim:
> > > > 
> > > > [22999.510329] Node 0 Normal free:229588kB min:90500kB ...
> > > > [22999.510334] Node 1 Normal free:600036kB min:91948kB ...
> > > > 
> > > > The patch is absolutely required for us to prevent continuous oom killing 
> > > > of processes after a single process has been oom killed and its memory is 
> > > > in the process of being freed.
> > > 
> > > OK, could you play with the patch/idea suggested in
> > > http://lkml.kernel.org/r/20170615122031.GL1486@dhcp22.suse.cz?
> > > 
> > 
> > I cannot, I am trying to unblock a stable kernel release to my production 
> > that is obviously fixed with this patch and cannot experiment with 
> > uncompiled and untested patches that introduce otherwise unnecessary 
> > locking into the __mmput() path and is based on speculation rather than 
> > hard data that __mmput() for some reason stalls for the oom victim's mm.  
> > I was hoping that this fix could make it in time for 4.12 since 4.12 kills 
> > 1-4 processes unnecessarily for each oom condition and then can review any 
> > tested solution you may propose at a later time.
> 
> I am sorry but I have really hard to make the oom reaper a reliable way
> to stop all the potential oom lockups go away. I do not want to
> reintroduce another potential lockup now. I also do not see why any
> solution should be rushed into. I have proposed a way to go and unless
> it is clear that this is not a way forward then I simply do not agree
> with any partial workarounds or shortcuts.

And the patch you proposed is broken.

----------
[  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or sacrifice child
[  161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  161.858503] ------------[ cut here ]------------
[  161.861512] kernel BUG at mm/memory.c:1381!
[  161.864154] invalid opcode: 0000 [#1] SMP
[  161.866599] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel vmw_balloon pcspkr ppdev shpchp parport_pc i2c_piix4 parport vmw_vmci xfs libcrc32c vmwgfx crc32c_intel drm_kms_helper serio_raw ttm drm e1000 mptspi scsi_transport_spi mptscsih mptbase ata_generic pata_acpi floppy
[  161.896811] CPU: 1 PID: 43 Comm: oom_reaper Not tainted 4.12.0-rc5+ #221
[  161.900458] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  161.905588] task: ffff937bb1c13200 task.stack: ffffa13cc0b94000
[  161.908876] RIP: 0010:unmap_page_range+0xa19/0xa60
[  161.911739] RSP: 0000:ffffa13cc0b97d08 EFLAGS: 00010282
[  161.914767] RAX: 0000000000000000 RBX: ffff937ba9e89300 RCX: 0000000000401000
[  161.918543] RDX: ffff937baf707440 RSI: ffff937baf707680 RDI: ffffa13cc0b97df0
[  161.922314] RBP: ffffa13cc0b97de0 R08: 0000000000000000 R09: 0000000000000000
[  161.926059] R10: 0000000000000000 R11: 000000001f1e8b15 R12: ffff937ba9e893c0
[  161.929789] R13: ffff937ba4198000 R14: ffff937baf707440 R15: ffff937ba9e89300
[  161.933509] FS:  0000000000000000(0000) GS:ffff937bb3800000(0000) knlGS:0000000000000000
[  161.937615] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  161.940774] CR2: 0000561fb93c1b00 CR3: 000000009ee11000 CR4: 00000000001406e0
[  161.944477] Call Trace:
[  161.946333]  ? __mutex_lock+0x574/0x950
[  161.948678]  ? __mutex_lock+0xce/0x950
[  161.950996]  ? __oom_reap_task_mm+0x49/0x170
[  161.953485]  __oom_reap_task_mm+0xd8/0x170
[  161.955893]  oom_reaper+0xac/0x1c0
[  161.957992]  ? remove_wait_queue+0x60/0x60
[  161.960688]  kthread+0x117/0x150
[  161.962719]  ? trace_event_raw_event_oom_score_adj_update+0xe0/0xe0
[  161.965920]  ? kthread_create_on_node+0x70/0x70
[  161.968417]  ret_from_fork+0x2a/0x40
[  161.970530] Code: 13 fb ff ff e9 25 fc ff ff 48 83 e8 01 e9 77 fc ff ff 48 83 e8 01 e9 62 fe ff ff e8 22 0a e6 ff 48 8b 7d 98 e8 09 ba ff ff 0f 0b <0f> 0b 48 83 e9 01 e9 a1 fb ff ff e8 03 a5 06 00 48 83 e9 01 e9 
[  161.979386] RIP: unmap_page_range+0xa19/0xa60 RSP: ffffa13cc0b97d08
[  161.982611] ---[ end trace ef2b349884b0aaa4 ]---
----------

Please carefully consider the reason why there is VM_BUG_ON() in __mmput(),
and clarify in your patch that what are possible side effects of racing
uprobe_clear_state()/exit_aio()/ksm_exit()/exit_mmap() etc. with
__oom_reap_task_mm() and clarify in your patch that there is no possibility
of waiting for direct/indirect memory allocation inside free_pgtables(),
in addition to fixing the bug above.

----------
	VM_BUG_ON(atomic_read(&mm->mm_users));

	uprobe_clear_state(mm);
	exit_aio(mm);
	ksm_exit(mm);
	khugepaged_exit(mm); /* must run before exit_mmap */
	exit_mmap(mm);
----------

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-16  0:54             ` Tetsuo Handa
  0 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-16  0:54 UTC (permalink / raw)
  To: Michal Hocko; +Cc: David Rientjes, Andrew Morton, linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 15-06-17 15:03:17, David Rientjes wrote:
> > On Thu, 15 Jun 2017, Michal Hocko wrote:
> > 
> > > > Yes, quite a bit in testing.
> > > > 
> > > > One oom kill shows the system to be oom:
> > > > 
> > > > [22999.488705] Node 0 Normal free:90484kB min:90500kB ...
> > > > [22999.488711] Node 1 Normal free:91536kB min:91948kB ...
> > > > 
> > > > followed up by one or more unnecessary oom kills showing the oom killer 
> > > > racing with memory freeing of the victim:
> > > > 
> > > > [22999.510329] Node 0 Normal free:229588kB min:90500kB ...
> > > > [22999.510334] Node 1 Normal free:600036kB min:91948kB ...
> > > > 
> > > > The patch is absolutely required for us to prevent continuous oom killing 
> > > > of processes after a single process has been oom killed and its memory is 
> > > > in the process of being freed.
> > > 
> > > OK, could you play with the patch/idea suggested in
> > > http://lkml.kernel.org/r/20170615122031.GL1486@dhcp22.suse.cz?
> > > 
> > 
> > I cannot, I am trying to unblock a stable kernel release to my production 
> > that is obviously fixed with this patch and cannot experiment with 
> > uncompiled and untested patches that introduce otherwise unnecessary 
> > locking into the __mmput() path and is based on speculation rather than 
> > hard data that __mmput() for some reason stalls for the oom victim's mm.  
> > I was hoping that this fix could make it in time for 4.12 since 4.12 kills 
> > 1-4 processes unnecessarily for each oom condition and then can review any 
> > tested solution you may propose at a later time.
> 
> I am sorry but I have really hard to make the oom reaper a reliable way
> to stop all the potential oom lockups go away. I do not want to
> reintroduce another potential lockup now. I also do not see why any
> solution should be rushed into. I have proposed a way to go and unless
> it is clear that this is not a way forward then I simply do not agree
> with any partial workarounds or shortcuts.

And the patch you proposed is broken.

----------
[  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or sacrifice child
[  161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  161.858503] ------------[ cut here ]------------
[  161.861512] kernel BUG at mm/memory.c:1381!
[  161.864154] invalid opcode: 0000 [#1] SMP
[  161.866599] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel vmw_balloon pcspkr ppdev shpchp parport_pc i2c_piix4 parport vmw_vmci xfs libcrc32c vmwgfx crc32c_intel drm_kms_helper serio_raw ttm drm e1000 mptspi scsi_transport_spi mptscsih mptbase ata_generic pata_acpi floppy
[  161.896811] CPU: 1 PID: 43 Comm: oom_reaper Not tainted 4.12.0-rc5+ #221
[  161.900458] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  161.905588] task: ffff937bb1c13200 task.stack: ffffa13cc0b94000
[  161.908876] RIP: 0010:unmap_page_range+0xa19/0xa60
[  161.911739] RSP: 0000:ffffa13cc0b97d08 EFLAGS: 00010282
[  161.914767] RAX: 0000000000000000 RBX: ffff937ba9e89300 RCX: 0000000000401000
[  161.918543] RDX: ffff937baf707440 RSI: ffff937baf707680 RDI: ffffa13cc0b97df0
[  161.922314] RBP: ffffa13cc0b97de0 R08: 0000000000000000 R09: 0000000000000000
[  161.926059] R10: 0000000000000000 R11: 000000001f1e8b15 R12: ffff937ba9e893c0
[  161.929789] R13: ffff937ba4198000 R14: ffff937baf707440 R15: ffff937ba9e89300
[  161.933509] FS:  0000000000000000(0000) GS:ffff937bb3800000(0000) knlGS:0000000000000000
[  161.937615] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  161.940774] CR2: 0000561fb93c1b00 CR3: 000000009ee11000 CR4: 00000000001406e0
[  161.944477] Call Trace:
[  161.946333]  ? __mutex_lock+0x574/0x950
[  161.948678]  ? __mutex_lock+0xce/0x950
[  161.950996]  ? __oom_reap_task_mm+0x49/0x170
[  161.953485]  __oom_reap_task_mm+0xd8/0x170
[  161.955893]  oom_reaper+0xac/0x1c0
[  161.957992]  ? remove_wait_queue+0x60/0x60
[  161.960688]  kthread+0x117/0x150
[  161.962719]  ? trace_event_raw_event_oom_score_adj_update+0xe0/0xe0
[  161.965920]  ? kthread_create_on_node+0x70/0x70
[  161.968417]  ret_from_fork+0x2a/0x40
[  161.970530] Code: 13 fb ff ff e9 25 fc ff ff 48 83 e8 01 e9 77 fc ff ff 48 83 e8 01 e9 62 fe ff ff e8 22 0a e6 ff 48 8b 7d 98 e8 09 ba ff ff 0f 0b <0f> 0b 48 83 e9 01 e9 a1 fb ff ff e8 03 a5 06 00 48 83 e9 01 e9 
[  161.979386] RIP: unmap_page_range+0xa19/0xa60 RSP: ffffa13cc0b97d08
[  161.982611] ---[ end trace ef2b349884b0aaa4 ]---
----------

Please carefully consider the reason why there is VM_BUG_ON() in __mmput(),
and clarify in your patch that what are possible side effects of racing
uprobe_clear_state()/exit_aio()/ksm_exit()/exit_mmap() etc. with
__oom_reap_task_mm() and clarify in your patch that there is no possibility
of waiting for direct/indirect memory allocation inside free_pgtables(),
in addition to fixing the bug above.

----------
	VM_BUG_ON(atomic_read(&mm->mm_users));

	uprobe_clear_state(mm);
	exit_aio(mm);
	ksm_exit(mm);
	khugepaged_exit(mm); /* must run before exit_mmap */
	exit_mmap(mm);
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-16  0:54             ` Tetsuo Handa
@ 2017-06-16  4:00               ` Tetsuo Handa
  -1 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-16  4:00 UTC (permalink / raw)
  To: Michal Hocko; +Cc: David Rientjes, Andrew Morton, linux-mm, linux-kernel

Tetsuo Handa wrote:
> and clarify in your patch that there is no possibility
> of waiting for direct/indirect memory allocation inside free_pgtables(),
> in addition to fixing the bug above.

Oops, this part was wrong, for __oom_reap_task_mm() will give up after
waiting for one second because down_read_trylock(&mm->mmap_sem) continues
failing due to down_write(&mm->mmap_sem) by exit_mmap().
# This is after all moving the location of "give up by timeout", isn't it? ;-)

Thus, clarify in your patch that there is no possibility of waiting for
direct/indirect memory allocation outside down_write()/up_write() (e.g.
i_mmap_lock_write() inside unmap_vmas(&tlb, vma, 0, -1) just before
down_write()).

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-16  4:00               ` Tetsuo Handa
  0 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-16  4:00 UTC (permalink / raw)
  To: Michal Hocko; +Cc: David Rientjes, Andrew Morton, linux-mm, linux-kernel

Tetsuo Handa wrote:
> and clarify in your patch that there is no possibility
> of waiting for direct/indirect memory allocation inside free_pgtables(),
> in addition to fixing the bug above.

Oops, this part was wrong, for __oom_reap_task_mm() will give up after
waiting for one second because down_read_trylock(&mm->mmap_sem) continues
failing due to down_write(&mm->mmap_sem) by exit_mmap().
# This is after all moving the location of "give up by timeout", isn't it? ;-)

Thus, clarify in your patch that there is no possibility of waiting for
direct/indirect memory allocation outside down_write()/up_write() (e.g.
i_mmap_lock_write() inside unmap_vmas(&tlb, vma, 0, -1) just before
down_write()).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-15 22:42             ` David Rientjes
@ 2017-06-16  8:06               ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-16  8:06 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Tetsuo Handa, linux-mm, linux-kernel

On Thu 15-06-17 15:42:23, David Rientjes wrote:
> On Fri, 16 Jun 2017, Michal Hocko wrote:
> 
> > I am sorry but I have really hard to make the oom reaper a reliable way
> > to stop all the potential oom lockups go away. I do not want to
> > reintroduce another potential lockup now.
> 
> Please show where this "potential lockup" ever existed in a bug report or 
> a testcase?

I am not aware of any specific bug report. But the main point of the
reaper is to close all _possible_ lockups due to oom victim being stuck
somewhere. exit_aio waits for all kiocbs. Can we guarantee that none
of them will depend on an allocation (directly or via a lock chain) to
proceed? Likewise ksm_exit/khugepaged_exit depend on mmap_sem for write
to proceed. Are we _guaranteed_ nobody can hold mmap_sem for read at
that time and depend on an allocation? Can we guarantee that __mmput
path will work without any depency on allocation in future?

> I have never seen __mmput() block when trying to free the 
> memory it maps.
> 
> > I also do not see why any
> > solution should be rushed into. I have proposed a way to go and unless
> > it is clear that this is not a way forward then I simply do not agree
> > with any partial workarounds or shortcuts.
> 
> This is not a shortcut, it is a bug fix.  4.12 kills 1-4 processes 
> unnecessarily as a result of setting MMF_OOM_SKIP incorrectly before the 
> mm's memory can be freed.  If you have not seen this issue before, which 
> is why you asked if I ever observed it in practice, then you have not 
> stress tested oom reaping.  It is very observable and reproducible.  

I am not questioning that it works for your particular test. I just
argue that it reduces the robustness of the oom reaper because it allows
oom victim to leave the reaper without MMF_OOM_SKIP set and that is the
core concept to guarantee a forward progress. So we should think about
something more appropriate.

> I do 
> not agree that adding additional and obscure locking into __mmput() is the 
> solution to what is plainly and obviously fixed with this simple patch.

Well, __mmput path already depends on the mmap_sem for write. So this is
not a new concept. I am not saying using mmap_sem is the only way. I
will think about that more.
 
> 4.12 needs to stop killing 2-5 processes on every oom condition instead of 
> 1.

Believe me, I am not dismissing the issue nor the fact it _has_ to be
fixed. I just disagree we should make the oom reaper less robust.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-16  8:06               ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-16  8:06 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Tetsuo Handa, linux-mm, linux-kernel

On Thu 15-06-17 15:42:23, David Rientjes wrote:
> On Fri, 16 Jun 2017, Michal Hocko wrote:
> 
> > I am sorry but I have really hard to make the oom reaper a reliable way
> > to stop all the potential oom lockups go away. I do not want to
> > reintroduce another potential lockup now.
> 
> Please show where this "potential lockup" ever existed in a bug report or 
> a testcase?

I am not aware of any specific bug report. But the main point of the
reaper is to close all _possible_ lockups due to oom victim being stuck
somewhere. exit_aio waits for all kiocbs. Can we guarantee that none
of them will depend on an allocation (directly or via a lock chain) to
proceed? Likewise ksm_exit/khugepaged_exit depend on mmap_sem for write
to proceed. Are we _guaranteed_ nobody can hold mmap_sem for read at
that time and depend on an allocation? Can we guarantee that __mmput
path will work without any depency on allocation in future?

> I have never seen __mmput() block when trying to free the 
> memory it maps.
> 
> > I also do not see why any
> > solution should be rushed into. I have proposed a way to go and unless
> > it is clear that this is not a way forward then I simply do not agree
> > with any partial workarounds or shortcuts.
> 
> This is not a shortcut, it is a bug fix.  4.12 kills 1-4 processes 
> unnecessarily as a result of setting MMF_OOM_SKIP incorrectly before the 
> mm's memory can be freed.  If you have not seen this issue before, which 
> is why you asked if I ever observed it in practice, then you have not 
> stress tested oom reaping.  It is very observable and reproducible.  

I am not questioning that it works for your particular test. I just
argue that it reduces the robustness of the oom reaper because it allows
oom victim to leave the reaper without MMF_OOM_SKIP set and that is the
core concept to guarantee a forward progress. So we should think about
something more appropriate.

> I do 
> not agree that adding additional and obscure locking into __mmput() is the 
> solution to what is plainly and obviously fixed with this simple patch.

Well, __mmput path already depends on the mmap_sem for write. So this is
not a new concept. I am not saying using mmap_sem is the only way. I
will think about that more.
 
> 4.12 needs to stop killing 2-5 processes on every oom condition instead of 
> 1.

Believe me, I am not dismissing the issue nor the fact it _has_ to be
fixed. I just disagree we should make the oom reaper less robust.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-16  0:54             ` Tetsuo Handa
@ 2017-06-16  8:39               ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-16  8:39 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: David Rientjes, Andrew Morton, linux-mm, linux-kernel

On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
[...]
> And the patch you proposed is broken.

Thanks for your testing!
 
> ----------
> [  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or sacrifice child
> [  161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [  161.858503] ------------[ cut here ]------------
> [  161.861512] kernel BUG at mm/memory.c:1381!

BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what is
going on here.
__oom_reap_task_mm				exit_mmap
						  free_pgtables
						  up_write(mm->mmap_sem)
  down_read_trylock(&mm->mmap_sem)
  						  remove_vma
    unmap_page_range

So we need to extend the mmap_sem coverage. See the updated diff (not
the full proper patch yet).

> Please carefully consider the reason why there is VM_BUG_ON() in __mmput(),
> and clarify in your patch that what are possible side effects of racing
> uprobe_clear_state()/exit_aio()/ksm_exit()/exit_mmap() etc. with
> __oom_reap_task_mm()

Yes that definitely needs to be checked. We basically rely on the racing
part of the __mmput to not modify the address space. oom_reaper doesn't
touch any vma state except it unmaps pages which can run in parallel.
exit_aio->kill_ioctx seemingly does vm_munmap but it a) uses the
mmap_sem for write and b) it doesn't actually unmap because exit_aio
does ctx->mmap_size = 0. {ksm,khugepaged}_exit just do some houskeeping
which is not modifying the address space. I hope I will find some more
time to work on this next week. Additional test would be highly
appreciated of course.

---
diff --git a/mm/mmap.c b/mm/mmap.c
index 3bd5ecd20d4d..ca58f8a2a217 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
 
+	/*
+	 * oom reaper might race with exit_mmap so make sure we won't free
+	 * page tables or unmap VMAs under its feet
+	 */
+	down_write(&mm->mmap_sem);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
 
@@ -2975,6 +2980,7 @@ void exit_mmap(struct mm_struct *mm)
 		vma = remove_vma(vma);
 	}
 	vm_unacct_memory(nr_accounted);
+	up_write(&mm->mmap_sem);
 }
 
 /* Insert vm structure into process list sorted by address
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925e7826..3df464f0f48b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -494,16 +494,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	}
 
 	/*
-	 * increase mm_users only after we know we will reap something so
-	 * that the mmput_async is called only when we have reaped something
-	 * and delayed __mmput doesn't matter that much
-	 */
-	if (!mmget_not_zero(mm)) {
-		up_read(&mm->mmap_sem);
-		goto unlock_oom;
-	}
-
-	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
 	 * is no longer stable. No barriers really needed because unmapping
 	 * should imply barriers already and the reader would hit a page fault
@@ -537,13 +527,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
-
-	/*
-	 * Drop our reference but make sure the mmput slow path is called from a
-	 * different context because we shouldn't risk we get stuck there and
-	 * put the oom_reaper out of the way.
-	 */
-	mmput_async(mm);
 unlock_oom:
 	mutex_unlock(&oom_lock);
 	return ret;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-16  8:39               ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-16  8:39 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: David Rientjes, Andrew Morton, linux-mm, linux-kernel

On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
[...]
> And the patch you proposed is broken.

Thanks for your testing!
 
> ----------
> [  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or sacrifice child
> [  161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [  161.858503] ------------[ cut here ]------------
> [  161.861512] kernel BUG at mm/memory.c:1381!

BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what is
going on here.
__oom_reap_task_mm				exit_mmap
						  free_pgtables
						  up_write(mm->mmap_sem)
  down_read_trylock(&mm->mmap_sem)
  						  remove_vma
    unmap_page_range

So we need to extend the mmap_sem coverage. See the updated diff (not
the full proper patch yet).

> Please carefully consider the reason why there is VM_BUG_ON() in __mmput(),
> and clarify in your patch that what are possible side effects of racing
> uprobe_clear_state()/exit_aio()/ksm_exit()/exit_mmap() etc. with
> __oom_reap_task_mm()

Yes that definitely needs to be checked. We basically rely on the racing
part of the __mmput to not modify the address space. oom_reaper doesn't
touch any vma state except it unmaps pages which can run in parallel.
exit_aio->kill_ioctx seemingly does vm_munmap but it a) uses the
mmap_sem for write and b) it doesn't actually unmap because exit_aio
does ctx->mmap_size = 0. {ksm,khugepaged}_exit just do some houskeeping
which is not modifying the address space. I hope I will find some more
time to work on this next week. Additional test would be highly
appreciated of course.

---
diff --git a/mm/mmap.c b/mm/mmap.c
index 3bd5ecd20d4d..ca58f8a2a217 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
 
+	/*
+	 * oom reaper might race with exit_mmap so make sure we won't free
+	 * page tables or unmap VMAs under its feet
+	 */
+	down_write(&mm->mmap_sem);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
 
@@ -2975,6 +2980,7 @@ void exit_mmap(struct mm_struct *mm)
 		vma = remove_vma(vma);
 	}
 	vm_unacct_memory(nr_accounted);
+	up_write(&mm->mmap_sem);
 }
 
 /* Insert vm structure into process list sorted by address
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925e7826..3df464f0f48b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -494,16 +494,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	}
 
 	/*
-	 * increase mm_users only after we know we will reap something so
-	 * that the mmput_async is called only when we have reaped something
-	 * and delayed __mmput doesn't matter that much
-	 */
-	if (!mmget_not_zero(mm)) {
-		up_read(&mm->mmap_sem);
-		goto unlock_oom;
-	}
-
-	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
 	 * is no longer stable. No barriers really needed because unmapping
 	 * should imply barriers already and the reader would hit a page fault
@@ -537,13 +527,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
-
-	/*
-	 * Drop our reference but make sure the mmput slow path is called from a
-	 * different context because we shouldn't risk we get stuck there and
-	 * put the oom_reaper out of the way.
-	 */
-	mmput_async(mm);
 unlock_oom:
 	mutex_unlock(&oom_lock);
 	return ret;
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-16  8:39               ` Michal Hocko
@ 2017-06-16 10:27                 ` Tetsuo Handa
  -1 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-16 10:27 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, akpm, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
> [...]
> > And the patch you proposed is broken.
> 
> Thanks for your testing!
>  
> > ----------
> > [  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or sacrifice child
> > [  161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > [  161.858503] ------------[ cut here ]------------
> > [  161.861512] kernel BUG at mm/memory.c:1381!
> 
> BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what is
> going on here.
> __oom_reap_task_mm				exit_mmap
> 						  free_pgtables
> 						  up_write(mm->mmap_sem)
>   down_read_trylock(&mm->mmap_sem)
>   						  remove_vma
>     unmap_page_range
> 
> So we need to extend the mmap_sem coverage. See the updated diff (not
> the full proper patch yet).

That diff is still wrong. We need to prevent __oom_reap_task_mm() from calling
unmap_page_range() when __mmput() already called exit_mm(), by setting/checking
MMF_OOM_SKIP like shown below.

diff --git a/kernel/fork.c b/kernel/fork.c
index e53770d..5ef715c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -902,6 +902,11 @@ static inline void __mmput(struct mm_struct *mm)
 	exit_aio(mm);
 	ksm_exit(mm);
 	khugepaged_exit(mm); /* must run before exit_mmap */
+	/*
+	 * oom reaper might race with exit_mmap so make sure we won't free
+	 * page tables under its feet
+	 */
+	down_write(&mm->mmap_sem);
 	exit_mmap(mm);
 	mm_put_huge_zero_page(mm);
 	set_mm_exe_file(mm, NULL);
@@ -913,6 +918,7 @@ static inline void __mmput(struct mm_struct *mm)
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
 	set_bit(MMF_OOM_SKIP, &mm->flags);
+	up_write(&mm->mmap_sem);
 	mmdrop(mm);
 }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143..98cca19 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -493,12 +493,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 		goto unlock_oom;
 	}
 
-	/*
-	 * increase mm_users only after we know we will reap something so
-	 * that the mmput_async is called only when we have reaped something
-	 * and delayed __mmput doesn't matter that much
-	 */
-	if (!mmget_not_zero(mm)) {
+	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
 		up_read(&mm->mmap_sem);
 		goto unlock_oom;
 	}
@@ -537,13 +532,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
-
-	/*
-	 * Drop our reference but make sure the mmput slow path is called from a
-	 * different context because we shouldn't risk we get stuck there and
-	 * put the oom_reaper out of the way.
-	 */
-	mmput_async(mm);
 unlock_oom:
 	mutex_unlock(&oom_lock);
 	return ret;



> 
> > Please carefully consider the reason why there is VM_BUG_ON() in __mmput(),
> > and clarify in your patch that what are possible side effects of racing
> > uprobe_clear_state()/exit_aio()/ksm_exit()/exit_mmap() etc. with
> > __oom_reap_task_mm()
> 
> Yes that definitely needs to be checked. We basically rely on the racing
> part of the __mmput to not modify the address space. oom_reaper doesn't
> touch any vma state except it unmaps pages which can run in parallel.
> exit_aio->kill_ioctx seemingly does vm_munmap but it a) uses the
> mmap_sem for write and b) it doesn't actually unmap because exit_aio
> does ctx->mmap_size = 0. {ksm,khugepaged}_exit just do some houskeeping
> which is not modifying the address space. I hope I will find some more
> time to work on this next week. Additional test would be highly
> appreciated of course.

Since the OOM reaper does not reap hugepages, khugepaged_exit() part could be
safe. But ksm_exit() part might interfere. If it is guaranteed to be safe,
what will go wrong if we move uprobe_clear_state()/exit_aio()/ksm_exit() etc.
to just before mmdrop() (i.e. after setting MMF_OOM_SKIP) ?

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-16 10:27                 ` Tetsuo Handa
  0 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-16 10:27 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, akpm, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
> [...]
> > And the patch you proposed is broken.
> 
> Thanks for your testing!
>  
> > ----------
> > [  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or sacrifice child
> > [  161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > [  161.858503] ------------[ cut here ]------------
> > [  161.861512] kernel BUG at mm/memory.c:1381!
> 
> BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what is
> going on here.
> __oom_reap_task_mm				exit_mmap
> 						  free_pgtables
> 						  up_write(mm->mmap_sem)
>   down_read_trylock(&mm->mmap_sem)
>   						  remove_vma
>     unmap_page_range
> 
> So we need to extend the mmap_sem coverage. See the updated diff (not
> the full proper patch yet).

That diff is still wrong. We need to prevent __oom_reap_task_mm() from calling
unmap_page_range() when __mmput() already called exit_mm(), by setting/checking
MMF_OOM_SKIP like shown below.

diff --git a/kernel/fork.c b/kernel/fork.c
index e53770d..5ef715c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -902,6 +902,11 @@ static inline void __mmput(struct mm_struct *mm)
 	exit_aio(mm);
 	ksm_exit(mm);
 	khugepaged_exit(mm); /* must run before exit_mmap */
+	/*
+	 * oom reaper might race with exit_mmap so make sure we won't free
+	 * page tables under its feet
+	 */
+	down_write(&mm->mmap_sem);
 	exit_mmap(mm);
 	mm_put_huge_zero_page(mm);
 	set_mm_exe_file(mm, NULL);
@@ -913,6 +918,7 @@ static inline void __mmput(struct mm_struct *mm)
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
 	set_bit(MMF_OOM_SKIP, &mm->flags);
+	up_write(&mm->mmap_sem);
 	mmdrop(mm);
 }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143..98cca19 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -493,12 +493,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 		goto unlock_oom;
 	}
 
-	/*
-	 * increase mm_users only after we know we will reap something so
-	 * that the mmput_async is called only when we have reaped something
-	 * and delayed __mmput doesn't matter that much
-	 */
-	if (!mmget_not_zero(mm)) {
+	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
 		up_read(&mm->mmap_sem);
 		goto unlock_oom;
 	}
@@ -537,13 +532,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
-
-	/*
-	 * Drop our reference but make sure the mmput slow path is called from a
-	 * different context because we shouldn't risk we get stuck there and
-	 * put the oom_reaper out of the way.
-	 */
-	mmput_async(mm);
 unlock_oom:
 	mutex_unlock(&oom_lock);
 	return ret;



> 
> > Please carefully consider the reason why there is VM_BUG_ON() in __mmput(),
> > and clarify in your patch that what are possible side effects of racing
> > uprobe_clear_state()/exit_aio()/ksm_exit()/exit_mmap() etc. with
> > __oom_reap_task_mm()
> 
> Yes that definitely needs to be checked. We basically rely on the racing
> part of the __mmput to not modify the address space. oom_reaper doesn't
> touch any vma state except it unmaps pages which can run in parallel.
> exit_aio->kill_ioctx seemingly does vm_munmap but it a) uses the
> mmap_sem for write and b) it doesn't actually unmap because exit_aio
> does ctx->mmap_size = 0. {ksm,khugepaged}_exit just do some houskeeping
> which is not modifying the address space. I hope I will find some more
> time to work on this next week. Additional test would be highly
> appreciated of course.

Since the OOM reaper does not reap hugepages, khugepaged_exit() part could be
safe. But ksm_exit() part might interfere. If it is guaranteed to be safe,
what will go wrong if we move uprobe_clear_state()/exit_aio()/ksm_exit() etc.
to just before mmdrop() (i.e. after setting MMF_OOM_SKIP) ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-16 10:27                 ` Tetsuo Handa
@ 2017-06-16 11:02                   ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-16 11:02 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, linux-mm, linux-kernel

On Fri 16-06-17 19:27:19, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
> > [...]
> > > And the patch you proposed is broken.
> > 
> > Thanks for your testing!
> >  
> > > ----------
> > > [  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or sacrifice child
> > > [  161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > > [  161.858503] ------------[ cut here ]------------
> > > [  161.861512] kernel BUG at mm/memory.c:1381!
> > 
> > BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what is
> > going on here.
> > __oom_reap_task_mm				exit_mmap
> > 						  free_pgtables
> > 						  up_write(mm->mmap_sem)
> >   down_read_trylock(&mm->mmap_sem)
> >   						  remove_vma
> >     unmap_page_range
> > 
> > So we need to extend the mmap_sem coverage. See the updated diff (not
> > the full proper patch yet).
> 
> That diff is still wrong. We need to prevent __oom_reap_task_mm() from calling
> unmap_page_range() when __mmput() already called exit_mm(), by setting/checking
> MMF_OOM_SKIP like shown below.

Care to explain why?
[...]
 
> Since the OOM reaper does not reap hugepages, khugepaged_exit() part could be
> safe.

I think you are mixing hugetlb and THP pages here. khugepaged_exit is
about later and we do unmap those.

> But ksm_exit() part might interfere.

How?

> If it is guaranteed to be safe,
> what will go wrong if we move uprobe_clear_state()/exit_aio()/ksm_exit() etc.
> to just before mmdrop() (i.e. after setting MMF_OOM_SKIP) ?

I do not see why those matter and why they should be any special. Unless
I miss anything we really do only care about page table tear down and
the address space modification. They do none of that.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-16 11:02                   ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-16 11:02 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, linux-mm, linux-kernel

On Fri 16-06-17 19:27:19, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
> > [...]
> > > And the patch you proposed is broken.
> > 
> > Thanks for your testing!
> >  
> > > ----------
> > > [  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or sacrifice child
> > > [  161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > > [  161.858503] ------------[ cut here ]------------
> > > [  161.861512] kernel BUG at mm/memory.c:1381!
> > 
> > BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what is
> > going on here.
> > __oom_reap_task_mm				exit_mmap
> > 						  free_pgtables
> > 						  up_write(mm->mmap_sem)
> >   down_read_trylock(&mm->mmap_sem)
> >   						  remove_vma
> >     unmap_page_range
> > 
> > So we need to extend the mmap_sem coverage. See the updated diff (not
> > the full proper patch yet).
> 
> That diff is still wrong. We need to prevent __oom_reap_task_mm() from calling
> unmap_page_range() when __mmput() already called exit_mm(), by setting/checking
> MMF_OOM_SKIP like shown below.

Care to explain why?
[...]
 
> Since the OOM reaper does not reap hugepages, khugepaged_exit() part could be
> safe.

I think you are mixing hugetlb and THP pages here. khugepaged_exit is
about later and we do unmap those.

> But ksm_exit() part might interfere.

How?

> If it is guaranteed to be safe,
> what will go wrong if we move uprobe_clear_state()/exit_aio()/ksm_exit() etc.
> to just before mmdrop() (i.e. after setting MMF_OOM_SKIP) ?

I do not see why those matter and why they should be any special. Unless
I miss anything we really do only care about page table tear down and
the address space modification. They do none of that.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-15 21:41       ` Michal Hocko
@ 2017-06-16 12:22         ` Tetsuo Handa
  -1 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-16 12:22 UTC (permalink / raw)
  To: mhocko, rientjes; +Cc: akpm, linux-mm, linux-kernel

Michal Hocko wrote:
> OK, could you play with the patch/idea suggested in
> http://lkml.kernel.org/r/20170615122031.GL1486@dhcp22.suse.cz?

I think we don't need to worry about mmap_sem dependency inside __mmput().
Since the OOM killer checks for !MMF_OOM_SKIP mm rather than TIF_MEMDIE thread,
we can keep the OOM killer disabled until we set MMF_OOM_SKIP to the victim's mm.
That is, elevating mm_users throughout the reaping procedure does not cause
premature victim selection, even after TIF_MEMDIE is cleared from the victim's
thread. Then, we don't need to use down_write()/up_write() for non OOM victim's mm
(nearly 100% of exit_mmap() calls), and can force partial reaping of OOM victim's mm
(nearly 0% of exit_mmap() calls) before __mmput() starts doing exit_aio() etc.
Patch is shown below. Only compile tested.

 include/linux/sched/coredump.h |  1 +
 mm/oom_kill.c                  | 80 ++++++++++++++++++++----------------------
 2 files changed, 40 insertions(+), 41 deletions(-)

diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index 98ae0d0..6b6237b 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -62,6 +62,7 @@ static inline int get_dumpable(struct mm_struct *mm)
  * on NFS restore
  */
 //#define MMF_EXE_FILE_CHANGED	18	/* see prctl_set_mm_exe_file() */
+#define MMF_OOM_REAPING		18	/* mm is supposed to be reaped */
 
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925..bdcf658 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -470,38 +470,9 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	bool ret = true;
-
-	/*
-	 * We have to make sure to not race with the victim exit path
-	 * and cause premature new oom victim selection:
-	 * __oom_reap_task_mm		exit_mm
-	 *   mmget_not_zero
-	 *				  mmput
-	 *				    atomic_dec_and_test
-	 *				  exit_oom_victim
-	 *				[...]
-	 *				out_of_memory
-	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
-	 *  unmap_page_range # frees some memory
-	 */
-	mutex_lock(&oom_lock);
 
-	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
-		goto unlock_oom;
-	}
-
-	/*
-	 * increase mm_users only after we know we will reap something so
-	 * that the mmput_async is called only when we have reaped something
-	 * and delayed __mmput doesn't matter that much
-	 */
-	if (!mmget_not_zero(mm)) {
-		up_read(&mm->mmap_sem);
-		goto unlock_oom;
-	}
+	if (!down_read_trylock(&mm->mmap_sem))
+		return false;
 
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
@@ -537,16 +508,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
-
-	/*
-	 * Drop our reference but make sure the mmput slow path is called from a
-	 * different context because we shouldn't risk we get stuck there and
-	 * put the oom_reaper out of the way.
-	 */
-	mmput_async(mm);
-unlock_oom:
-	mutex_unlock(&oom_lock);
-	return ret;
+	return true;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
@@ -573,9 +535,32 @@ static void oom_reap_task(struct task_struct *tsk)
 	/*
 	 * Hide this mm from OOM killer because it has been either reaped or
 	 * somebody can't call up_write(mmap_sem).
+	 *
+	 * Serialize setting of MMF_OOM_SKIP using oom_lock in order to
+	 * avoid race with select_bad_process() which causes premature
+	 * new oom victim selection.
+	 *
+	 * The OOM reaper:           An allocating task:
+	 *                             Failed get_page_from_freelist().
+	 *                             Enters into out_of_memory().
+	 *   Reaped memory enough to make get_page_from_freelist() succeed.
+	 *   Sets MMF_OOM_SKIP to mm.
+	 *                               Enters into select_bad_process().
+	 *                                 # MMF_OOM_SKIP mm selects new victim.
 	 */
+	mutex_lock(&oom_lock);
 	set_bit(MMF_OOM_SKIP, &mm->flags);
+	mutex_unlock(&oom_lock);
 
+	/*
+	 * Drop our reference but make sure the mmput slow path is called from a
+	 * different context because we shouldn't risk we get stuck there and
+	 * put the oom_reaper out of the way.
+	 */
+	if (test_bit(MMF_OOM_REAPING, &mm->flags)) {
+		clear_bit(MMF_OOM_REAPING, &mm->flags);
+		mmput_async(mm);
+	}
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
 }
@@ -658,6 +643,13 @@ static void mark_oom_victim(struct task_struct *tsk)
 	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
 		mmgrab(tsk->signal->oom_mm);
 
+#ifdef CONFIG_MMU
+	if (!test_bit(MMF_OOM_REAPING, &mm->flags)) {
+		set_bit(MMF_OOM_REAPING, &mm->flags);
+		mmget(mm);
+	}
+#endif
+
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
@@ -913,6 +905,12 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 		if (is_global_init(p)) {
 			can_oom_reap = false;
 			set_bit(MMF_OOM_SKIP, &mm->flags);
+#ifdef CONFIG_MMU
+			if (test_bit(MMF_OOM_REAPING, &mm->flags)) {
+				clear_bit(MMF_OOM_REAPING, &mm->flags);
+				mmput_async(mm);
+			}
+#endif
 			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
 					task_pid_nr(victim), victim->comm,
 					task_pid_nr(p), p->comm);

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-16 12:22         ` Tetsuo Handa
  0 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-16 12:22 UTC (permalink / raw)
  To: mhocko, rientjes; +Cc: akpm, linux-mm, linux-kernel

Michal Hocko wrote:
> OK, could you play with the patch/idea suggested in
> http://lkml.kernel.org/r/20170615122031.GL1486@dhcp22.suse.cz?

I think we don't need to worry about mmap_sem dependency inside __mmput().
Since the OOM killer checks for !MMF_OOM_SKIP mm rather than TIF_MEMDIE thread,
we can keep the OOM killer disabled until we set MMF_OOM_SKIP to the victim's mm.
That is, elevating mm_users throughout the reaping procedure does not cause
premature victim selection, even after TIF_MEMDIE is cleared from the victim's
thread. Then, we don't need to use down_write()/up_write() for non OOM victim's mm
(nearly 100% of exit_mmap() calls), and can force partial reaping of OOM victim's mm
(nearly 0% of exit_mmap() calls) before __mmput() starts doing exit_aio() etc.
Patch is shown below. Only compile tested.

 include/linux/sched/coredump.h |  1 +
 mm/oom_kill.c                  | 80 ++++++++++++++++++++----------------------
 2 files changed, 40 insertions(+), 41 deletions(-)

diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index 98ae0d0..6b6237b 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -62,6 +62,7 @@ static inline int get_dumpable(struct mm_struct *mm)
  * on NFS restore
  */
 //#define MMF_EXE_FILE_CHANGED	18	/* see prctl_set_mm_exe_file() */
+#define MMF_OOM_REAPING		18	/* mm is supposed to be reaped */
 
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925..bdcf658 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -470,38 +470,9 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	bool ret = true;
-
-	/*
-	 * We have to make sure to not race with the victim exit path
-	 * and cause premature new oom victim selection:
-	 * __oom_reap_task_mm		exit_mm
-	 *   mmget_not_zero
-	 *				  mmput
-	 *				    atomic_dec_and_test
-	 *				  exit_oom_victim
-	 *				[...]
-	 *				out_of_memory
-	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
-	 *  unmap_page_range # frees some memory
-	 */
-	mutex_lock(&oom_lock);
 
-	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
-		goto unlock_oom;
-	}
-
-	/*
-	 * increase mm_users only after we know we will reap something so
-	 * that the mmput_async is called only when we have reaped something
-	 * and delayed __mmput doesn't matter that much
-	 */
-	if (!mmget_not_zero(mm)) {
-		up_read(&mm->mmap_sem);
-		goto unlock_oom;
-	}
+	if (!down_read_trylock(&mm->mmap_sem))
+		return false;
 
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
@@ -537,16 +508,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
-
-	/*
-	 * Drop our reference but make sure the mmput slow path is called from a
-	 * different context because we shouldn't risk we get stuck there and
-	 * put the oom_reaper out of the way.
-	 */
-	mmput_async(mm);
-unlock_oom:
-	mutex_unlock(&oom_lock);
-	return ret;
+	return true;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
@@ -573,9 +535,32 @@ static void oom_reap_task(struct task_struct *tsk)
 	/*
 	 * Hide this mm from OOM killer because it has been either reaped or
 	 * somebody can't call up_write(mmap_sem).
+	 *
+	 * Serialize setting of MMF_OOM_SKIP using oom_lock in order to
+	 * avoid race with select_bad_process() which causes premature
+	 * new oom victim selection.
+	 *
+	 * The OOM reaper:           An allocating task:
+	 *                             Failed get_page_from_freelist().
+	 *                             Enters into out_of_memory().
+	 *   Reaped memory enough to make get_page_from_freelist() succeed.
+	 *   Sets MMF_OOM_SKIP to mm.
+	 *                               Enters into select_bad_process().
+	 *                                 # MMF_OOM_SKIP mm selects new victim.
 	 */
+	mutex_lock(&oom_lock);
 	set_bit(MMF_OOM_SKIP, &mm->flags);
+	mutex_unlock(&oom_lock);
 
+	/*
+	 * Drop our reference but make sure the mmput slow path is called from a
+	 * different context because we shouldn't risk we get stuck there and
+	 * put the oom_reaper out of the way.
+	 */
+	if (test_bit(MMF_OOM_REAPING, &mm->flags)) {
+		clear_bit(MMF_OOM_REAPING, &mm->flags);
+		mmput_async(mm);
+	}
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
 }
@@ -658,6 +643,13 @@ static void mark_oom_victim(struct task_struct *tsk)
 	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
 		mmgrab(tsk->signal->oom_mm);
 
+#ifdef CONFIG_MMU
+	if (!test_bit(MMF_OOM_REAPING, &mm->flags)) {
+		set_bit(MMF_OOM_REAPING, &mm->flags);
+		mmget(mm);
+	}
+#endif
+
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
@@ -913,6 +905,12 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 		if (is_global_init(p)) {
 			can_oom_reap = false;
 			set_bit(MMF_OOM_SKIP, &mm->flags);
+#ifdef CONFIG_MMU
+			if (test_bit(MMF_OOM_REAPING, &mm->flags)) {
+				clear_bit(MMF_OOM_REAPING, &mm->flags);
+				mmput_async(mm);
+			}
+#endif
 			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
 					task_pid_nr(victim), victim->comm,
 					task_pid_nr(p), p->comm);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-16 12:22         ` Tetsuo Handa
@ 2017-06-16 14:12           ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-16 14:12 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, linux-mm, linux-kernel

On Fri 16-06-17 21:22:20, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > OK, could you play with the patch/idea suggested in
> > http://lkml.kernel.org/r/20170615122031.GL1486@dhcp22.suse.cz?
> 
> I think we don't need to worry about mmap_sem dependency inside __mmput().
> Since the OOM killer checks for !MMF_OOM_SKIP mm rather than TIF_MEMDIE thread,
> we can keep the OOM killer disabled until we set MMF_OOM_SKIP to the victim's mm.
> That is, elevating mm_users throughout the reaping procedure does not cause
> premature victim selection, even after TIF_MEMDIE is cleared from the victim's
> thread. Then, we don't need to use down_write()/up_write() for non OOM victim's mm
> (nearly 100% of exit_mmap() calls), and can force partial reaping of OOM victim's mm
> (nearly 0% of exit_mmap() calls) before __mmput() starts doing exit_aio() etc.
> Patch is shown below. Only compile tested.

Yes, that would be another approach.
 
>  include/linux/sched/coredump.h |  1 +
>  mm/oom_kill.c                  | 80 ++++++++++++++++++++----------------------
>  2 files changed, 40 insertions(+), 41 deletions(-)
> 
> diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> index 98ae0d0..6b6237b 100644
> --- a/include/linux/sched/coredump.h
> +++ b/include/linux/sched/coredump.h
> @@ -62,6 +62,7 @@ static inline int get_dumpable(struct mm_struct *mm)
>   * on NFS restore
>   */
>  //#define MMF_EXE_FILE_CHANGED	18	/* see prctl_set_mm_exe_file() */
> +#define MMF_OOM_REAPING		18	/* mm is supposed to be reaped */

A new flag is not really needed. We can increase it for _each_ reapable
oom victim.

> @@ -658,6 +643,13 @@ static void mark_oom_victim(struct task_struct *tsk)
>  	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
>  		mmgrab(tsk->signal->oom_mm);
>  
> +#ifdef CONFIG_MMU
> +	if (!test_bit(MMF_OOM_REAPING, &mm->flags)) {
> +		set_bit(MMF_OOM_REAPING, &mm->flags);
> +		mmget(mm);
> +	}
> +#endif

This would really need a big fat warning explaining why we do not need
mmget_not_zero. We rely on exit_mm doing both mmput and tsk->mm = NULL
under the task_lock and mark_oom_victim is called under this lock as
well and task_will_free_mem resp. find_lock_task_mm makes sure we do not
even consider tasks wihout mm.

I agree that a solution which is fully contained inside the oom proper
would be preferable to touching __mmput path.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-16 14:12           ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-16 14:12 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, linux-mm, linux-kernel

On Fri 16-06-17 21:22:20, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > OK, could you play with the patch/idea suggested in
> > http://lkml.kernel.org/r/20170615122031.GL1486@dhcp22.suse.cz?
> 
> I think we don't need to worry about mmap_sem dependency inside __mmput().
> Since the OOM killer checks for !MMF_OOM_SKIP mm rather than TIF_MEMDIE thread,
> we can keep the OOM killer disabled until we set MMF_OOM_SKIP to the victim's mm.
> That is, elevating mm_users throughout the reaping procedure does not cause
> premature victim selection, even after TIF_MEMDIE is cleared from the victim's
> thread. Then, we don't need to use down_write()/up_write() for non OOM victim's mm
> (nearly 100% of exit_mmap() calls), and can force partial reaping of OOM victim's mm
> (nearly 0% of exit_mmap() calls) before __mmput() starts doing exit_aio() etc.
> Patch is shown below. Only compile tested.

Yes, that would be another approach.
 
>  include/linux/sched/coredump.h |  1 +
>  mm/oom_kill.c                  | 80 ++++++++++++++++++++----------------------
>  2 files changed, 40 insertions(+), 41 deletions(-)
> 
> diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> index 98ae0d0..6b6237b 100644
> --- a/include/linux/sched/coredump.h
> +++ b/include/linux/sched/coredump.h
> @@ -62,6 +62,7 @@ static inline int get_dumpable(struct mm_struct *mm)
>   * on NFS restore
>   */
>  //#define MMF_EXE_FILE_CHANGED	18	/* see prctl_set_mm_exe_file() */
> +#define MMF_OOM_REAPING		18	/* mm is supposed to be reaped */

A new flag is not really needed. We can increase it for _each_ reapable
oom victim.

> @@ -658,6 +643,13 @@ static void mark_oom_victim(struct task_struct *tsk)
>  	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
>  		mmgrab(tsk->signal->oom_mm);
>  
> +#ifdef CONFIG_MMU
> +	if (!test_bit(MMF_OOM_REAPING, &mm->flags)) {
> +		set_bit(MMF_OOM_REAPING, &mm->flags);
> +		mmget(mm);
> +	}
> +#endif

This would really need a big fat warning explaining why we do not need
mmget_not_zero. We rely on exit_mm doing both mmput and tsk->mm = NULL
under the task_lock and mark_oom_victim is called under this lock as
well and task_will_free_mem resp. find_lock_task_mm makes sure we do not
even consider tasks wihout mm.

I agree that a solution which is fully contained inside the oom proper
would be preferable to touching __mmput path.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Re: [patch] mm, oom: prevent additional oom kills before memoryis freed
  2017-06-16 11:02                   ` Michal Hocko
@ 2017-06-16 14:26                     ` Tetsuo Handa
  -1 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-16 14:26 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, akpm, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 16-06-17 19:27:19, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
> > > [...]
> > > > And the patch you proposed is broken.
> > > 
> > > Thanks for your testing!
> > >  
> > > > ----------
> > > > [  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or sacrifice child
> > > > [  161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > > > [  161.858503] ------------[ cut here ]------------
> > > > [  161.861512] kernel BUG at mm/memory.c:1381!
> > > 
> > > BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what is
> > > going on here.
> > > __oom_reap_task_mm				exit_mmap
> > > 						  free_pgtables
> > > 						  up_write(mm->mmap_sem)
> > >   down_read_trylock(&mm->mmap_sem)
> > >   						  remove_vma
> > >     unmap_page_range
> > > 
> > > So we need to extend the mmap_sem coverage. See the updated diff (not
> > > the full proper patch yet).
> > 
> > That diff is still wrong. We need to prevent __oom_reap_task_mm() from calling
> > unmap_page_range() when __mmput() already called exit_mm(), by setting/checking
> > MMF_OOM_SKIP like shown below.
> 
> Care to explain why?

I don't know. Your updated diff is causing below oops.

----------
[   90.621890] Out of memory: Kill process 2671 (a.out) score 999 or sacrifice child
[   90.624636] Killed process 2671 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[   90.861308] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[   90.863695] Modules linked in: coretemp pcspkr sg vmw_vmci shpchp i2c_piix4 sd_mod ata_generic pata_acpi serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mptspi scsi_transport_spi mptscsih ahci mptbase libahci drm e1000 ata_piix i2c_core libata ipv6
[   90.870672] CPU: 2 PID: 47 Comm: oom_reaper Not tainted 4.12.0-rc5+ #128
[   90.872929] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   90.875995] task: ffff88007b6cd2c0 task.stack: ffff88007b6d0000
[   90.878290] RIP: 0010:__oom_reap_task_mm+0xa1/0x160
[   90.880242] RSP: 0018:ffff88007b6d3df0 EFLAGS: 00010202
[   90.882240] RAX: 6b6b6b6b6b6b6b6b RBX: ffff880077b8cd40 RCX: 0000000000000000
[   90.884612] RDX: ffff88007b6d3e18 RSI: ffff880077b8cd40 RDI: ffff88007b6d3df0
[   90.887001] RBP: ffff88007b6d3e98 R08: ffff88007b6cdb08 R09: ffff88007b6cdad0
[   90.889702] R10: 0000000000000000 R11: 000000009213dd65 R12: ffff880077b8ce00
[   90.892973] R13: ffff880076f48040 R14: 6b6b6b6b6b6b6b6b R15: ffff880077b8cd40
[   90.895765] FS:  0000000000000000(0000) GS:ffff88007c600000(0000) knlGS:0000000000000000
[   90.899015] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   90.901462] CR2: 00007feeae35ac80 CR3: 0000000076e21000 CR4: 00000000001406e0
[   90.904019] Call Trace:
[   90.905518]  ? process_timeout+0x1/0x10
[   90.907280]  oom_reaper+0xa2/0x1b0
[   90.908946]  ? wake_up_bit+0x30/0x30
[   90.911391]  kthread+0x10d/0x140
[   90.913003]  ? __oom_reap_task_mm+0x160/0x160
[   90.914936]  ? kthread_create_on_node+0x60/0x60
[   90.916733]  ret_from_fork+0x27/0x40
[   90.918307] Code: c3 e8 54 82 f1 ff f0 80 8b 7a 04 00 00 40 48 8d bd 58 ff ff ff 48 83 c9 ff 31 d2 48 89 de e8 57 12 03 00 4c 8b 33 4d 85 f6 74 3b <49> 8b 46 50 a9 00 24 40 00 75 27 49 83 be 90 00 00 00 00 74 04 
[   90.923922] RIP: __oom_reap_task_mm+0xa1/0x160 RSP: ffff88007b6d3df0
[   90.929583] ---[ end trace 20f6ec27ed25c461 ]---
----------

It is you who should explain why. I found my patch via trial and error.

> [...]
>  
> > Since the OOM reaper does not reap hugepages, khugepaged_exit() part could be
> > safe.
> 
> I think you are mixing hugetlb and THP pages here. khugepaged_exit is
> about later and we do unmap those.

OK.

> 
> > But ksm_exit() part might interfere.
> 
> How?

Why you think it does not interfere?
Please explain it in your patch description because your patch is
trying to do a tricky thing. I'm not a MM person. I just suspect
what you think no problem.

> 
> > If it is guaranteed to be safe,
> > what will go wrong if we move uprobe_clear_state()/exit_aio()/ksm_exit() etc.
> > to just before mmdrop() (i.e. after setting MMF_OOM_SKIP) ?
> 
> I do not see why those matter and why they should be any special. Unless
> I miss anything we really do only care about page table tear down and
> the address space modification. They do none of that.

I think the patch I posted at
http://lkml.kernel.org/r/201706162122.ACE95321.tOFLOOVFFHMSJQ@I-love.SAKURA.ne.jp
will be safer, and you agree that a solution which is fully contained inside
the oom proper would be preferable. Thus, let's start checking that patch.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Re: [patch] mm, oom: prevent additional oom kills before memoryis freed
@ 2017-06-16 14:26                     ` Tetsuo Handa
  0 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-16 14:26 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, akpm, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 16-06-17 19:27:19, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
> > > [...]
> > > > And the patch you proposed is broken.
> > > 
> > > Thanks for your testing!
> > >  
> > > > ----------
> > > > [  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or sacrifice child
> > > > [  161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > > > [  161.858503] ------------[ cut here ]------------
> > > > [  161.861512] kernel BUG at mm/memory.c:1381!
> > > 
> > > BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what is
> > > going on here.
> > > __oom_reap_task_mm				exit_mmap
> > > 						  free_pgtables
> > > 						  up_write(mm->mmap_sem)
> > >   down_read_trylock(&mm->mmap_sem)
> > >   						  remove_vma
> > >     unmap_page_range
> > > 
> > > So we need to extend the mmap_sem coverage. See the updated diff (not
> > > the full proper patch yet).
> > 
> > That diff is still wrong. We need to prevent __oom_reap_task_mm() from calling
> > unmap_page_range() when __mmput() already called exit_mm(), by setting/checking
> > MMF_OOM_SKIP like shown below.
> 
> Care to explain why?

I don't know. Your updated diff is causing below oops.

----------
[   90.621890] Out of memory: Kill process 2671 (a.out) score 999 or sacrifice child
[   90.624636] Killed process 2671 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[   90.861308] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[   90.863695] Modules linked in: coretemp pcspkr sg vmw_vmci shpchp i2c_piix4 sd_mod ata_generic pata_acpi serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mptspi scsi_transport_spi mptscsih ahci mptbase libahci drm e1000 ata_piix i2c_core libata ipv6
[   90.870672] CPU: 2 PID: 47 Comm: oom_reaper Not tainted 4.12.0-rc5+ #128
[   90.872929] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   90.875995] task: ffff88007b6cd2c0 task.stack: ffff88007b6d0000
[   90.878290] RIP: 0010:__oom_reap_task_mm+0xa1/0x160
[   90.880242] RSP: 0018:ffff88007b6d3df0 EFLAGS: 00010202
[   90.882240] RAX: 6b6b6b6b6b6b6b6b RBX: ffff880077b8cd40 RCX: 0000000000000000
[   90.884612] RDX: ffff88007b6d3e18 RSI: ffff880077b8cd40 RDI: ffff88007b6d3df0
[   90.887001] RBP: ffff88007b6d3e98 R08: ffff88007b6cdb08 R09: ffff88007b6cdad0
[   90.889702] R10: 0000000000000000 R11: 000000009213dd65 R12: ffff880077b8ce00
[   90.892973] R13: ffff880076f48040 R14: 6b6b6b6b6b6b6b6b R15: ffff880077b8cd40
[   90.895765] FS:  0000000000000000(0000) GS:ffff88007c600000(0000) knlGS:0000000000000000
[   90.899015] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   90.901462] CR2: 00007feeae35ac80 CR3: 0000000076e21000 CR4: 00000000001406e0
[   90.904019] Call Trace:
[   90.905518]  ? process_timeout+0x1/0x10
[   90.907280]  oom_reaper+0xa2/0x1b0
[   90.908946]  ? wake_up_bit+0x30/0x30
[   90.911391]  kthread+0x10d/0x140
[   90.913003]  ? __oom_reap_task_mm+0x160/0x160
[   90.914936]  ? kthread_create_on_node+0x60/0x60
[   90.916733]  ret_from_fork+0x27/0x40
[   90.918307] Code: c3 e8 54 82 f1 ff f0 80 8b 7a 04 00 00 40 48 8d bd 58 ff ff ff 48 83 c9 ff 31 d2 48 89 de e8 57 12 03 00 4c 8b 33 4d 85 f6 74 3b <49> 8b 46 50 a9 00 24 40 00 75 27 49 83 be 90 00 00 00 00 74 04 
[   90.923922] RIP: __oom_reap_task_mm+0xa1/0x160 RSP: ffff88007b6d3df0
[   90.929583] ---[ end trace 20f6ec27ed25c461 ]---
----------

It is you who should explain why. I found my patch via trial and error.

> [...]
>  
> > Since the OOM reaper does not reap hugepages, khugepaged_exit() part could be
> > safe.
> 
> I think you are mixing hugetlb and THP pages here. khugepaged_exit is
> about later and we do unmap those.

OK.

> 
> > But ksm_exit() part might interfere.
> 
> How?

Why you think it does not interfere?
Please explain it in your patch description because your patch is
trying to do a tricky thing. I'm not a MM person. I just suspect
what you think no problem.

> 
> > If it is guaranteed to be safe,
> > what will go wrong if we move uprobe_clear_state()/exit_aio()/ksm_exit() etc.
> > to just before mmdrop() (i.e. after setting MMF_OOM_SKIP) ?
> 
> I do not see why those matter and why they should be any special. Unless
> I miss anything we really do only care about page table tear down and
> the address space modification. They do none of that.

I think the patch I posted at
http://lkml.kernel.org/r/201706162122.ACE95321.tOFLOOVFFHMSJQ@I-love.SAKURA.ne.jp
will be safer, and you agree that a solution which is fully contained inside
the oom proper would be preferable. Thus, let's start checking that patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Re: [patch] mm, oom: prevent additional oom kills before memoryis freed
  2017-06-16 14:26                     ` Tetsuo Handa
@ 2017-06-16 14:42                       ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-16 14:42 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, linux-mm, linux-kernel

On Fri 16-06-17 23:26:20, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 16-06-17 19:27:19, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
> > > > [...]
> > > > > And the patch you proposed is broken.
> > > > 
> > > > Thanks for your testing!
> > > >  
> > > > > ----------
> > > > > [  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or sacrifice child
> > > > > [  161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > > > > [  161.858503] ------------[ cut here ]------------
> > > > > [  161.861512] kernel BUG at mm/memory.c:1381!
> > > > 
> > > > BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what is
> > > > going on here.
> > > > __oom_reap_task_mm				exit_mmap
> > > > 						  free_pgtables
> > > > 						  up_write(mm->mmap_sem)
> > > >   down_read_trylock(&mm->mmap_sem)
> > > >   						  remove_vma
> > > >     unmap_page_range
> > > > 
> > > > So we need to extend the mmap_sem coverage. See the updated diff (not
> > > > the full proper patch yet).
> > > 
> > > That diff is still wrong. We need to prevent __oom_reap_task_mm() from calling
> > > unmap_page_range() when __mmput() already called exit_mm(), by setting/checking
> > > MMF_OOM_SKIP like shown below.
> > 
> > Care to explain why?
> 
> I don't know. Your updated diff is causing below oops.
> 
> ----------
> [   90.621890] Out of memory: Kill process 2671 (a.out) score 999 or sacrifice child
> [   90.624636] Killed process 2671 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [   90.861308] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [   90.863695] Modules linked in: coretemp pcspkr sg vmw_vmci shpchp i2c_piix4 sd_mod ata_generic pata_acpi serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mptspi scsi_transport_spi mptscsih ahci mptbase libahci drm e1000 ata_piix i2c_core libata ipv6
> [   90.870672] CPU: 2 PID: 47 Comm: oom_reaper Not tainted 4.12.0-rc5+ #128
> [   90.872929] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> [   90.875995] task: ffff88007b6cd2c0 task.stack: ffff88007b6d0000
> [   90.878290] RIP: 0010:__oom_reap_task_mm+0xa1/0x160

What does this dissassemble to on your kernel? Care to post addr2line?

[...]
 
> It is you who should explain why.

I can definitely try but it was really impossible to deduce that you
have seen an oops from your previous email...

> I found my patch via trial and error.
> 
> > [...]
> >  
> > > Since the OOM reaper does not reap hugepages, khugepaged_exit() part could be
> > > safe.
> > 
> > I think you are mixing hugetlb and THP pages here. khugepaged_exit is
> > about later and we do unmap those.
> 
> OK.
> 
> > 
> > > But ksm_exit() part might interfere.
> > 
> > How?
> 
> Why you think it does not interfere?

Because it doesn't modify address space in any way.

> Please explain it in your patch description because your patch is
> trying to do a tricky thing. I'm not a MM person. I just suspect
> what you think no problem.

yeah, poking holes into a patch is a reasonable approach but if you make
a statement that "ksm_exit() part might interfere." then you should back
it by an argument.
 
> > > If it is guaranteed to be safe,
> > > what will go wrong if we move uprobe_clear_state()/exit_aio()/ksm_exit() etc.
> > > to just before mmdrop() (i.e. after setting MMF_OOM_SKIP) ?
> > 
> > I do not see why those matter and why they should be any special. Unless
> > I miss anything we really do only care about page table tear down and
> > the address space modification. They do none of that.
> 
> I think the patch I posted at
> http://lkml.kernel.org/r/201706162122.ACE95321.tOFLOOVFFHMSJQ@I-love.SAKURA.ne.jp
> will be safer, and you agree that a solution which is fully contained inside
> the oom proper would be preferable. Thus, let's start checking that patch.

Yes I will keep thinking about your approach some more but it indeed
seems easier and less tricky.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Re: [patch] mm, oom: prevent additional oom kills before memoryis freed
@ 2017-06-16 14:42                       ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-16 14:42 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, linux-mm, linux-kernel

On Fri 16-06-17 23:26:20, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 16-06-17 19:27:19, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
> > > > [...]
> > > > > And the patch you proposed is broken.
> > > > 
> > > > Thanks for your testing!
> > > >  
> > > > > ----------
> > > > > [  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or sacrifice child
> > > > > [  161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > > > > [  161.858503] ------------[ cut here ]------------
> > > > > [  161.861512] kernel BUG at mm/memory.c:1381!
> > > > 
> > > > BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what is
> > > > going on here.
> > > > __oom_reap_task_mm				exit_mmap
> > > > 						  free_pgtables
> > > > 						  up_write(mm->mmap_sem)
> > > >   down_read_trylock(&mm->mmap_sem)
> > > >   						  remove_vma
> > > >     unmap_page_range
> > > > 
> > > > So we need to extend the mmap_sem coverage. See the updated diff (not
> > > > the full proper patch yet).
> > > 
> > > That diff is still wrong. We need to prevent __oom_reap_task_mm() from calling
> > > unmap_page_range() when __mmput() already called exit_mm(), by setting/checking
> > > MMF_OOM_SKIP like shown below.
> > 
> > Care to explain why?
> 
> I don't know. Your updated diff is causing below oops.
> 
> ----------
> [   90.621890] Out of memory: Kill process 2671 (a.out) score 999 or sacrifice child
> [   90.624636] Killed process 2671 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [   90.861308] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [   90.863695] Modules linked in: coretemp pcspkr sg vmw_vmci shpchp i2c_piix4 sd_mod ata_generic pata_acpi serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mptspi scsi_transport_spi mptscsih ahci mptbase libahci drm e1000 ata_piix i2c_core libata ipv6
> [   90.870672] CPU: 2 PID: 47 Comm: oom_reaper Not tainted 4.12.0-rc5+ #128
> [   90.872929] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> [   90.875995] task: ffff88007b6cd2c0 task.stack: ffff88007b6d0000
> [   90.878290] RIP: 0010:__oom_reap_task_mm+0xa1/0x160

What does this dissassemble to on your kernel? Care to post addr2line?

[...]
 
> It is you who should explain why.

I can definitely try but it was really impossible to deduce that you
have seen an oops from your previous email...

> I found my patch via trial and error.
> 
> > [...]
> >  
> > > Since the OOM reaper does not reap hugepages, khugepaged_exit() part could be
> > > safe.
> > 
> > I think you are mixing hugetlb and THP pages here. khugepaged_exit is
> > about later and we do unmap those.
> 
> OK.
> 
> > 
> > > But ksm_exit() part might interfere.
> > 
> > How?
> 
> Why you think it does not interfere?

Because it doesn't modify address space in any way.

> Please explain it in your patch description because your patch is
> trying to do a tricky thing. I'm not a MM person. I just suspect
> what you think no problem.

yeah, poking holes into a patch is a reasonable approach but if you make
a statement that "ksm_exit() part might interfere." then you should back
it by an argument.
 
> > > If it is guaranteed to be safe,
> > > what will go wrong if we move uprobe_clear_state()/exit_aio()/ksm_exit() etc.
> > > to just before mmdrop() (i.e. after setting MMF_OOM_SKIP) ?
> > 
> > I do not see why those matter and why they should be any special. Unless
> > I miss anything we really do only care about page table tear down and
> > the address space modification. They do none of that.
> 
> I think the patch I posted at
> http://lkml.kernel.org/r/201706162122.ACE95321.tOFLOOVFFHMSJQ@I-love.SAKURA.ne.jp
> will be safer, and you agree that a solution which is fully contained inside
> the oom proper would be preferable. Thus, let's start checking that patch.

Yes I will keep thinking about your approach some more but it indeed
seems easier and less tricky.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH] mm,oom_kill: Close race window of needlessly selecting new victims.
  2017-06-16 14:12           ` Michal Hocko
@ 2017-06-17  5:17             ` Tetsuo Handa
  -1 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-17  5:17 UTC (permalink / raw)
  To: mhocko, rientjes; +Cc: akpm, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 16-06-17 21:22:20, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > OK, could you play with the patch/idea suggested in
> > > http://lkml.kernel.org/r/20170615122031.GL1486@dhcp22.suse.cz?
> > 
> > I think we don't need to worry about mmap_sem dependency inside __mmput().
> > Since the OOM killer checks for !MMF_OOM_SKIP mm rather than TIF_MEMDIE thread,
> > we can keep the OOM killer disabled until we set MMF_OOM_SKIP to the victim's mm.
> > That is, elevating mm_users throughout the reaping procedure does not cause
> > premature victim selection, even after TIF_MEMDIE is cleared from the victim's
> > thread. Then, we don't need to use down_write()/up_write() for non OOM victim's mm
> > (nearly 100% of exit_mmap() calls), and can force partial reaping of OOM victim's mm
> > (nearly 0% of exit_mmap() calls) before __mmput() starts doing exit_aio() etc.
> > Patch is shown below. Only compile tested.
> 
> Yes, that would be another approach.
>  
> >  include/linux/sched/coredump.h |  1 +
> >  mm/oom_kill.c                  | 80 ++++++++++++++++++++----------------------
> >  2 files changed, 40 insertions(+), 41 deletions(-)
> > 
> > diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> > index 98ae0d0..6b6237b 100644
> > --- a/include/linux/sched/coredump.h
> > +++ b/include/linux/sched/coredump.h
> > @@ -62,6 +62,7 @@ static inline int get_dumpable(struct mm_struct *mm)
> >   * on NFS restore
> >   */
> >  //#define MMF_EXE_FILE_CHANGED	18	/* see prctl_set_mm_exe_file() */
> > +#define MMF_OOM_REAPING		18	/* mm is supposed to be reaped */
> 
> A new flag is not really needed. We can increase it for _each_ reapable
> oom victim.

Yes if based on an assumption that number of mark_oom_victim() calls and
wake_oom_reaper() calls matches...

> 
> > @@ -658,6 +643,13 @@ static void mark_oom_victim(struct task_struct *tsk)
> >  	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
> >  		mmgrab(tsk->signal->oom_mm);
> >  
> > +#ifdef CONFIG_MMU
> > +	if (!test_bit(MMF_OOM_REAPING, &mm->flags)) {
> > +		set_bit(MMF_OOM_REAPING, &mm->flags);
> > +		mmget(mm);
> > +	}
> > +#endif
> 
> This would really need a big fat warning explaining why we do not need
> mmget_not_zero. We rely on exit_mm doing both mmput and tsk->mm = NULL
> under the task_lock and mark_oom_victim is called under this lock as
> well and task_will_free_mem resp. find_lock_task_mm makes sure we do not
> even consider tasks wihout mm.
> 
> I agree that a solution which is fully contained inside the oom proper
> would be preferable to touching __mmput path.

OK. Updated patch shown below.

----------------------------------------
>From 5ed8922bd281456793408328c8b27899ebdd298b Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 17 Jun 2017 14:04:09 +0900
Subject: [PATCH] mm,oom_kill: Close race window of needlessly selecting new
 victims.

David Rientjes has reported that the OOM killer can select next OOM victim
when existing OOM victims called __mmput() before the OOM reaper starts
trying to unmap pages. In his testing, 4.12-rc kernels are killing 1-4
processes unnecessarily for each OOM condition.

----------
  One oom kill shows the system to be oom:

  [22999.488705] Node 0 Normal free:90484kB min:90500kB ...
  [22999.488711] Node 1 Normal free:91536kB min:91948kB ...

  followed up by one or more unnecessary oom kills showing the oom killer
  racing with memory freeing of the victim:

  [22999.510329] Node 0 Normal free:229588kB min:90500kB ...
  [22999.510334] Node 1 Normal free:600036kB min:91948kB ...
----------

This is because commit e5e3f4c4f0e95ecb ("mm, oom_reaper: make sure that
mmput_async is called only when memory was reaped") kept not to set
MMF_OOM_REAPED flag when the OOM reaper found that mm_users == 0 but then
commit 26db62f179d112d3 ("oom: keep mm of the killed task available") by
error changed to always set MMF_OOM_REAPED flag. As a result, MMF_OOM_SKIP
flag is immediately set without waiting for __mmput() because __mmput()
might get stuck before setting MMF_OOM_SKIP flag, and led to above report.

A workaround is to let the OOM reaper wait for a while and give up via
timeout as if the OOM reaper was unable to take mmap_sem for read. But
we want to avoid timeout based approach if possible. Therefore, this
patch takes a different approach.

This patch elevates mm_users of an OOM victim's mm, and prevents the OOM
victim from calling __mmput() before the OOM reaper starts trying to unmap
pages. In this way, we can force the OOM reaper to try to reclaim some
memory before setting MMF_OOM_SKIP flag.

Since commit 862e3073b3eed13f ("mm, oom: get rid of
signal_struct::oom_victims") changed to keep the OOM killer disabled
until MMF_OOM_SKIP is set on the victim's mm rather than until TIF_MEMDIE
is cleared from the victim's thread, we can keep the OOM killer disabled
until __oom_reap_task_mm() or __mmput() reclaims some memory and sets
MMF_OOM_SKIP flag. That is, elevating mm_users throughout the reaping
procedure does not cause premature victim selection.

This patch also reduces the range of oom_lock protection in
__oom_reap_task_mm() introduced by commit e2fe14564d3316d1 ("oom_reaper:
close race with exiting task"). Since the OOM killer is kept disabled until
MMF_OOM_SKIP flag is set, we don't need to serialize throughout the reaping
procedure; serializing only setting MMF_OOM_SKIP flag is enough.
This allows the OOM reaper to start reclaiming memory as soon as
wake_oom_reaper() is called by the OOM killer in order to compensate for
delaying exit_mmap() call caused by elevating mm_users of the OOM victim's
mm when the OOM victim can exit smoothly, for currently the OOM killer
calls schedule_timeout_killable(1) at out_of_memory() with oom_lock held.
This might also allow direct OOM reaping (i.e. let the OOM killer call
__oom_reap_task_mm() because we would have 16KB kernel stack with stack
overflow detection) and replace the OOM reaper kernel thread with a
workqueue (i.e. save one kernel thread).

Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: 26db62f179d112d3 ("oom: keep mm of the killed task available")
Cc: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 94 ++++++++++++++++++++++++++++++++---------------------------
 1 file changed, 51 insertions(+), 43 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143..cf1d331 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -470,38 +470,9 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	bool ret = true;
-
-	/*
-	 * We have to make sure to not race with the victim exit path
-	 * and cause premature new oom victim selection:
-	 * __oom_reap_task_mm		exit_mm
-	 *   mmget_not_zero
-	 *				  mmput
-	 *				    atomic_dec_and_test
-	 *				  exit_oom_victim
-	 *				[...]
-	 *				out_of_memory
-	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
-	 *  unmap_page_range # frees some memory
-	 */
-	mutex_lock(&oom_lock);
-
-	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
-		goto unlock_oom;
-	}
 
-	/*
-	 * increase mm_users only after we know we will reap something so
-	 * that the mmput_async is called only when we have reaped something
-	 * and delayed __mmput doesn't matter that much
-	 */
-	if (!mmget_not_zero(mm)) {
-		up_read(&mm->mmap_sem);
-		goto unlock_oom;
-	}
+	if (!down_read_trylock(&mm->mmap_sem))
+		return false;
 
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
@@ -537,16 +508,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
-
-	/*
-	 * Drop our reference but make sure the mmput slow path is called from a
-	 * different context because we shouldn't risk we get stuck there and
-	 * put the oom_reaper out of the way.
-	 */
-	mmput_async(mm);
-unlock_oom:
-	mutex_unlock(&oom_lock);
-	return ret;
+	return true;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
@@ -569,12 +531,31 @@ static void oom_reap_task(struct task_struct *tsk)
 
 done:
 	tsk->oom_reaper_list = NULL;
+	/*
+	 * Drop a mm_users reference taken by mark_oom_victim().
+	 * A mm_count reference taken by mark_oom_victim() remains.
+	 */
+	mmput_async(mm);
 
 	/*
 	 * Hide this mm from OOM killer because it has been either reaped or
 	 * somebody can't call up_write(mmap_sem).
+	 *
+	 * Serialize setting of MMF_OOM_SKIP using oom_lock in order to
+	 * avoid race with select_bad_process() which causes premature
+	 * new oom victim selection.
+	 *
+	 * The OOM reaper:           An allocating task:
+	 *                             Failed get_page_from_freelist().
+	 *                             Enters into out_of_memory().
+	 *   Reaped memory enough to make get_page_from_freelist() succeed.
+	 *   Sets MMF_OOM_SKIP to mm.
+	 *                               Enters into select_bad_process().
+	 *                                 # MMF_OOM_SKIP mm selects new victim.
 	 */
+	mutex_lock(&oom_lock);
 	set_bit(MMF_OOM_SKIP, &mm->flags);
+	mutex_unlock(&oom_lock);
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
@@ -602,12 +583,16 @@ static int oom_reaper(void *unused)
 
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	if (!oom_reaper_th)
+	if (!oom_reaper_th) {
+		mmput_async(tsk->signal->oom_mm);
 		return;
+	}
 
 	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+	if (tsk == oom_reaper_list || tsk->oom_reaper_list) {
+		mmput_async(tsk->signal->oom_mm);
 		return;
+	}
 
 	get_task_struct(tsk);
 
@@ -650,12 +635,32 @@ static void mark_oom_victim(struct task_struct *tsk)
 	struct mm_struct *mm = tsk->mm;
 
 	WARN_ON(oom_killer_disabled);
+#ifdef CONFIG_MMU
+	/*
+	 * Take a mm_users reference so that __oom_reap_task_mm() can unmap
+	 * pages without risking a race condition where final mmput() from
+	 * exit_mm() from do_exit() triggered __mmput() and gets stuck there
+	 * (but __oom_reap_task_mm() cannot unmap pages due to mm_users == 0).
+	 *
+	 * Since all callers guarantee that this mm is stable (hold task_lock
+	 * or task is current), we can safely use mmget() here.
+	 *
+	 * When dropping this reference, mmput_async() has to be used because
+	 * __mmput() can get stuck which in turn keeps the OOM killer/reaper
+	 * disabled forever.
+	 */
+	mmget(mm);
+#endif
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
 
 	/* oom_mm is bound to the signal struct life time. */
 	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
+		/*
+		 * Take a mm_count reference so that we can examine flags value
+		 * when tsk_is_oom_victim() is true.
+		 */
 		mmgrab(tsk->signal->oom_mm);
 
 	/*
@@ -908,6 +913,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 		if (is_global_init(p)) {
 			can_oom_reap = false;
 			set_bit(MMF_OOM_SKIP, &mm->flags);
+#ifdef CONFIG_MMU
+			mmput_async(mm);
+#endif
 			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
 					task_pid_nr(victim), victim->comm,
 					task_pid_nr(p), p->comm);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH] mm,oom_kill: Close race window of needlessly selecting new victims.
@ 2017-06-17  5:17             ` Tetsuo Handa
  0 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-17  5:17 UTC (permalink / raw)
  To: mhocko, rientjes; +Cc: akpm, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 16-06-17 21:22:20, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > OK, could you play with the patch/idea suggested in
> > > http://lkml.kernel.org/r/20170615122031.GL1486@dhcp22.suse.cz?
> > 
> > I think we don't need to worry about mmap_sem dependency inside __mmput().
> > Since the OOM killer checks for !MMF_OOM_SKIP mm rather than TIF_MEMDIE thread,
> > we can keep the OOM killer disabled until we set MMF_OOM_SKIP to the victim's mm.
> > That is, elevating mm_users throughout the reaping procedure does not cause
> > premature victim selection, even after TIF_MEMDIE is cleared from the victim's
> > thread. Then, we don't need to use down_write()/up_write() for non OOM victim's mm
> > (nearly 100% of exit_mmap() calls), and can force partial reaping of OOM victim's mm
> > (nearly 0% of exit_mmap() calls) before __mmput() starts doing exit_aio() etc.
> > Patch is shown below. Only compile tested.
> 
> Yes, that would be another approach.
>  
> >  include/linux/sched/coredump.h |  1 +
> >  mm/oom_kill.c                  | 80 ++++++++++++++++++++----------------------
> >  2 files changed, 40 insertions(+), 41 deletions(-)
> > 
> > diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> > index 98ae0d0..6b6237b 100644
> > --- a/include/linux/sched/coredump.h
> > +++ b/include/linux/sched/coredump.h
> > @@ -62,6 +62,7 @@ static inline int get_dumpable(struct mm_struct *mm)
> >   * on NFS restore
> >   */
> >  //#define MMF_EXE_FILE_CHANGED	18	/* see prctl_set_mm_exe_file() */
> > +#define MMF_OOM_REAPING		18	/* mm is supposed to be reaped */
> 
> A new flag is not really needed. We can increase it for _each_ reapable
> oom victim.

Yes if based on an assumption that number of mark_oom_victim() calls and
wake_oom_reaper() calls matches...

> 
> > @@ -658,6 +643,13 @@ static void mark_oom_victim(struct task_struct *tsk)
> >  	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
> >  		mmgrab(tsk->signal->oom_mm);
> >  
> > +#ifdef CONFIG_MMU
> > +	if (!test_bit(MMF_OOM_REAPING, &mm->flags)) {
> > +		set_bit(MMF_OOM_REAPING, &mm->flags);
> > +		mmget(mm);
> > +	}
> > +#endif
> 
> This would really need a big fat warning explaining why we do not need
> mmget_not_zero. We rely on exit_mm doing both mmput and tsk->mm = NULL
> under the task_lock and mark_oom_victim is called under this lock as
> well and task_will_free_mem resp. find_lock_task_mm makes sure we do not
> even consider tasks wihout mm.
> 
> I agree that a solution which is fully contained inside the oom proper
> would be preferable to touching __mmput path.

OK. Updated patch shown below.

----------------------------------------
>From 5ed8922bd281456793408328c8b27899ebdd298b Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 17 Jun 2017 14:04:09 +0900
Subject: [PATCH] mm,oom_kill: Close race window of needlessly selecting new
 victims.

David Rientjes has reported that the OOM killer can select next OOM victim
when existing OOM victims called __mmput() before the OOM reaper starts
trying to unmap pages. In his testing, 4.12-rc kernels are killing 1-4
processes unnecessarily for each OOM condition.

----------
  One oom kill shows the system to be oom:

  [22999.488705] Node 0 Normal free:90484kB min:90500kB ...
  [22999.488711] Node 1 Normal free:91536kB min:91948kB ...

  followed up by one or more unnecessary oom kills showing the oom killer
  racing with memory freeing of the victim:

  [22999.510329] Node 0 Normal free:229588kB min:90500kB ...
  [22999.510334] Node 1 Normal free:600036kB min:91948kB ...
----------

This is because commit e5e3f4c4f0e95ecb ("mm, oom_reaper: make sure that
mmput_async is called only when memory was reaped") kept not to set
MMF_OOM_REAPED flag when the OOM reaper found that mm_users == 0 but then
commit 26db62f179d112d3 ("oom: keep mm of the killed task available") by
error changed to always set MMF_OOM_REAPED flag. As a result, MMF_OOM_SKIP
flag is immediately set without waiting for __mmput() because __mmput()
might get stuck before setting MMF_OOM_SKIP flag, and led to above report.

A workaround is to let the OOM reaper wait for a while and give up via
timeout as if the OOM reaper was unable to take mmap_sem for read. But
we want to avoid timeout based approach if possible. Therefore, this
patch takes a different approach.

This patch elevates mm_users of an OOM victim's mm, and prevents the OOM
victim from calling __mmput() before the OOM reaper starts trying to unmap
pages. In this way, we can force the OOM reaper to try to reclaim some
memory before setting MMF_OOM_SKIP flag.

Since commit 862e3073b3eed13f ("mm, oom: get rid of
signal_struct::oom_victims") changed to keep the OOM killer disabled
until MMF_OOM_SKIP is set on the victim's mm rather than until TIF_MEMDIE
is cleared from the victim's thread, we can keep the OOM killer disabled
until __oom_reap_task_mm() or __mmput() reclaims some memory and sets
MMF_OOM_SKIP flag. That is, elevating mm_users throughout the reaping
procedure does not cause premature victim selection.

This patch also reduces the range of oom_lock protection in
__oom_reap_task_mm() introduced by commit e2fe14564d3316d1 ("oom_reaper:
close race with exiting task"). Since the OOM killer is kept disabled until
MMF_OOM_SKIP flag is set, we don't need to serialize throughout the reaping
procedure; serializing only setting MMF_OOM_SKIP flag is enough.
This allows the OOM reaper to start reclaiming memory as soon as
wake_oom_reaper() is called by the OOM killer in order to compensate for
delaying exit_mmap() call caused by elevating mm_users of the OOM victim's
mm when the OOM victim can exit smoothly, for currently the OOM killer
calls schedule_timeout_killable(1) at out_of_memory() with oom_lock held.
This might also allow direct OOM reaping (i.e. let the OOM killer call
__oom_reap_task_mm() because we would have 16KB kernel stack with stack
overflow detection) and replace the OOM reaper kernel thread with a
workqueue (i.e. save one kernel thread).

Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: 26db62f179d112d3 ("oom: keep mm of the killed task available")
Cc: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 94 ++++++++++++++++++++++++++++++++---------------------------
 1 file changed, 51 insertions(+), 43 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143..cf1d331 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -470,38 +470,9 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	bool ret = true;
-
-	/*
-	 * We have to make sure to not race with the victim exit path
-	 * and cause premature new oom victim selection:
-	 * __oom_reap_task_mm		exit_mm
-	 *   mmget_not_zero
-	 *				  mmput
-	 *				    atomic_dec_and_test
-	 *				  exit_oom_victim
-	 *				[...]
-	 *				out_of_memory
-	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
-	 *  unmap_page_range # frees some memory
-	 */
-	mutex_lock(&oom_lock);
-
-	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
-		goto unlock_oom;
-	}
 
-	/*
-	 * increase mm_users only after we know we will reap something so
-	 * that the mmput_async is called only when we have reaped something
-	 * and delayed __mmput doesn't matter that much
-	 */
-	if (!mmget_not_zero(mm)) {
-		up_read(&mm->mmap_sem);
-		goto unlock_oom;
-	}
+	if (!down_read_trylock(&mm->mmap_sem))
+		return false;
 
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
@@ -537,16 +508,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
-
-	/*
-	 * Drop our reference but make sure the mmput slow path is called from a
-	 * different context because we shouldn't risk we get stuck there and
-	 * put the oom_reaper out of the way.
-	 */
-	mmput_async(mm);
-unlock_oom:
-	mutex_unlock(&oom_lock);
-	return ret;
+	return true;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
@@ -569,12 +531,31 @@ static void oom_reap_task(struct task_struct *tsk)
 
 done:
 	tsk->oom_reaper_list = NULL;
+	/*
+	 * Drop a mm_users reference taken by mark_oom_victim().
+	 * A mm_count reference taken by mark_oom_victim() remains.
+	 */
+	mmput_async(mm);
 
 	/*
 	 * Hide this mm from OOM killer because it has been either reaped or
 	 * somebody can't call up_write(mmap_sem).
+	 *
+	 * Serialize setting of MMF_OOM_SKIP using oom_lock in order to
+	 * avoid race with select_bad_process() which causes premature
+	 * new oom victim selection.
+	 *
+	 * The OOM reaper:           An allocating task:
+	 *                             Failed get_page_from_freelist().
+	 *                             Enters into out_of_memory().
+	 *   Reaped memory enough to make get_page_from_freelist() succeed.
+	 *   Sets MMF_OOM_SKIP to mm.
+	 *                               Enters into select_bad_process().
+	 *                                 # MMF_OOM_SKIP mm selects new victim.
 	 */
+	mutex_lock(&oom_lock);
 	set_bit(MMF_OOM_SKIP, &mm->flags);
+	mutex_unlock(&oom_lock);
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
@@ -602,12 +583,16 @@ static int oom_reaper(void *unused)
 
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	if (!oom_reaper_th)
+	if (!oom_reaper_th) {
+		mmput_async(tsk->signal->oom_mm);
 		return;
+	}
 
 	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+	if (tsk == oom_reaper_list || tsk->oom_reaper_list) {
+		mmput_async(tsk->signal->oom_mm);
 		return;
+	}
 
 	get_task_struct(tsk);
 
@@ -650,12 +635,32 @@ static void mark_oom_victim(struct task_struct *tsk)
 	struct mm_struct *mm = tsk->mm;
 
 	WARN_ON(oom_killer_disabled);
+#ifdef CONFIG_MMU
+	/*
+	 * Take a mm_users reference so that __oom_reap_task_mm() can unmap
+	 * pages without risking a race condition where final mmput() from
+	 * exit_mm() from do_exit() triggered __mmput() and gets stuck there
+	 * (but __oom_reap_task_mm() cannot unmap pages due to mm_users == 0).
+	 *
+	 * Since all callers guarantee that this mm is stable (hold task_lock
+	 * or task is current), we can safely use mmget() here.
+	 *
+	 * When dropping this reference, mmput_async() has to be used because
+	 * __mmput() can get stuck which in turn keeps the OOM killer/reaper
+	 * disabled forever.
+	 */
+	mmget(mm);
+#endif
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
 
 	/* oom_mm is bound to the signal struct life time. */
 	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
+		/*
+		 * Take a mm_count reference so that we can examine flags value
+		 * when tsk_is_oom_victim() is true.
+		 */
 		mmgrab(tsk->signal->oom_mm);
 
 	/*
@@ -908,6 +913,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 		if (is_global_init(p)) {
 			can_oom_reap = false;
 			set_bit(MMF_OOM_SKIP, &mm->flags);
+#ifdef CONFIG_MMU
+			mmput_async(mm);
+#endif
 			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
 					task_pid_nr(victim), victim->comm,
 					task_pid_nr(p), p->comm);
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-16 14:42                       ` Michal Hocko
@ 2017-06-17 13:30                         ` Tetsuo Handa
  -1 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-17 13:30 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, akpm, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 16-06-17 23:26:20, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 16-06-17 19:27:19, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
> > > > > [...]
> > > > > > And the patch you proposed is broken.
> > > > > 
> > > > > Thanks for your testing!
> > > > >  
> > > > > > ----------
> > > > > > [  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or sacrifice child
> > > > > > [  161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > > > > > [  161.858503] ------------[ cut here ]------------
> > > > > > [  161.861512] kernel BUG at mm/memory.c:1381!
> > > > > 
> > > > > BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what is
> > > > > going on here.
> > > > > __oom_reap_task_mm				exit_mmap
> > > > > 						  free_pgtables
> > > > > 						  up_write(mm->mmap_sem)
> > > > >   down_read_trylock(&mm->mmap_sem)
> > > > >   						  remove_vma
> > > > >     unmap_page_range
> > > > > 
> > > > > So we need to extend the mmap_sem coverage. See the updated diff (not
> > > > > the full proper patch yet).
> > > > 
> > > > That diff is still wrong. We need to prevent __oom_reap_task_mm() from calling
> > > > unmap_page_range() when __mmput() already called exit_mm(), by setting/checking
> > > > MMF_OOM_SKIP like shown below.
> > > 
> > > Care to explain why?
> > 
> > I don't know. Your updated diff is causing below oops.
> > 
> > ----------
> > [   90.621890] Out of memory: Kill process 2671 (a.out) score 999 or sacrifice child
> > [   90.624636] Killed process 2671 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > [   90.861308] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > [   90.863695] Modules linked in: coretemp pcspkr sg vmw_vmci shpchp i2c_piix4 sd_mod ata_generic pata_acpi serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mptspi scsi_transport_spi mptscsih ahci mptbase libahci drm e1000 ata_piix i2c_core libata ipv6
> > [   90.870672] CPU: 2 PID: 47 Comm: oom_reaper Not tainted 4.12.0-rc5+ #128
> > [   90.872929] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> > [   90.875995] task: ffff88007b6cd2c0 task.stack: ffff88007b6d0000
> > [   90.878290] RIP: 0010:__oom_reap_task_mm+0xa1/0x160
> 
> What does this dissassemble to on your kernel? Care to post addr2line?

----------
[  114.427451] Out of memory: Kill process 2876 (a.out) score 999 or sacrifice child
[  114.430208] Killed process 2876 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  114.436753] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  114.439129] Modules linked in: pcspkr coretemp sg vmw_vmci i2c_piix4 shpchp sd_mod ata_generic pata_acpi serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci e1000 libahci mptspi scsi_transport_spi drm mptscsih mptbase i2c_core ata_piix libata ipv6
[  114.446220] CPU: 0 PID: 47 Comm: oom_reaper Not tainted 4.12.0-rc5+ #133
[  114.448705] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  114.451695] task: ffff88007b6cd2c0 task.stack: ffff88007b6d0000
[  114.453703] RIP: 0010:__oom_reap_task_mm+0xa1/0x160
[  114.455422] RSP: 0000:ffff88007b6d3df0 EFLAGS: 00010202
[  114.457527] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8800670eaa40 RCX: 0000000000000000
[  114.460002] RDX: ffff88007b6d3e18 RSI: ffff8800670eaa40 RDI: ffff88007b6d3df0
[  114.462206] RBP: ffff88007b6d3e98 R08: ffff88007b6cdb08 R09: ffff88007b6cdad0
[  114.464390] R10: 0000000000000000 R11: 0000000083f54a84 R12: ffff8800670eab00
[  114.466659] R13: ffff880067211bc0 R14: 6b6b6b6b6b6b6b6b R15: ffff8800670eaa40
[  114.469126] FS:  0000000000000000(0000) GS:ffff88007c200000(0000) knlGS:0000000000000000
[  114.471496] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  114.473540] CR2: 00007f55d759d050 CR3: 0000000079ff4000 CR4: 00000000001406f0
[  114.475773] Call Trace:
[  114.477078]  oom_reaper+0xa2/0x1b0 /* oom_reap_task at mm/oom_kill.c:542 (inlined by) oom_reaper at mm/oom_kill.c:580 */
[  114.478569]  ? wake_up_bit+0x30/0x30
[  114.480058]  kthread+0x10d/0x140
[  114.481656]  ? __oom_reap_task_mm+0x160/0x160
[  114.483308]  ? kthread_create_on_node+0x60/0x60
[  114.485075]  ret_from_fork+0x27/0x40
[  114.486620] Code: c3 e8 54 82 f1 ff f0 80 8b 7a 04 00 00 40 48 8d bd 58 ff ff ff 48 83 c9 ff 31 d2 48 89 de e8 57 12 03 00 4c 8b 33 4d 85 f6 74 3b <49> 8b 46 50 a9 00 24 40 00 75 27 49 83 be 90 00 00 00 00 74 04 
[  114.491819] RIP: __oom_reap_task_mm+0xa1/0x160 RSP: ffff88007b6d3df0
[  114.494520] ---[ end trace e254efa6cf6f5fe6 ]---
----------

The __oom_reap_task_mm+0xa1/0x160 is __oom_reap_task_mm at mm/oom_kill.c:472
which is "struct vm_area_struct *vma;" line in __oom_reap_task_mm().
The __oom_reap_task_mm+0xb1/0x160 is __oom_reap_task_mm at mm/oom_kill.c:519
which is "if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))" line.
The <49> 8b 46 50 is "vma->vm_flags" in can_madv_dontneed_vma(vma) from __oom_reap_task_mm().

Is it safe for the OOM reaper to call tlb_gather_mmu()/unmap_page_range()/tlb_finish_mmu() sequence
after the OOM victim already completed tlb_gather_mmu()/unmap_vmas()/free_pgtables()/tlb_finish_mmu()/
remove_vma() sequence from exit_mmap() from __mmput() from mmput() from exit_mm() from do_exit() ?
I guess we need to prevent the OOM reaper from calling the sequence if the OOM victim already did
the sequence. And my patch did it via trial and error.

----------
unlock_oom:
        mutex_unlock(&oom_lock);
     26a:       48 c7 c7 00 00 00 00    mov    $0x0,%rdi
     271:       e8 00 00 00 00          callq  276 <__oom_reap_task_mm+0x56>
        return ret;
}
     276:       48 8b 55 d8             mov    -0x28(%rbp),%rdx
     27a:       65 48 33 14 25 28 00    xor    %gs:0x28,%rdx
     281:       00 00
     283:       89 d8                   mov    %ebx,%eax
     285:       75 10                   jne    297 <__oom_reap_task_mm+0x77>
     287:       48 81 c4 88 00 00 00    add    $0x88,%rsp
     28e:       5b                      pop    %rbx
     28f:       41 5c                   pop    %r12
     291:       41 5d                   pop    %r13
     293:       41 5e                   pop    %r14
     295:       5d                      pop    %rbp
     296:       c3                      retq
     297:       e8 00 00 00 00          callq  29c <__oom_reap_task_mm+0x7c>
 */
static __always_inline void
set_bit(long nr, volatile unsigned long *addr)
{
        if (IS_IMMEDIATE(nr)) {
                asm volatile(LOCK_PREFIX "orb %1,%0"
     29c:       f0 80 8b 7a 04 00 00    lock orb $0x40,0x47a(%rbx)
     2a3:       40
         * should imply barriers already and the reader would hit a page fault
         * if it stumbled over a reaped memory.
         */
        set_bit(MMF_UNSTABLE, &mm->flags);

        tlb_gather_mmu(&tlb, mm, 0, -1);
     2a4:       48 8d bd 58 ff ff ff    lea    -0xa8(%rbp),%rdi
     2ab:       48 83 c9 ff             or     $0xffffffffffffffff,%rcx
     2af:       31 d2                   xor    %edx,%edx
     2b1:       48 89 de                mov    %rbx,%rsi
     2b4:       e8 00 00 00 00          callq  2b9 <__oom_reap_task_mm+0x99>
        for (vma = mm->mmap ; vma; vma = vma->vm_next) {
     2b9:       4c 8b 33                mov    (%rbx),%r14
     2bc:       4d 85 f6                test   %r14,%r14
     2bf:       74 3b                   je     2fc <__oom_reap_task_mm+0xdc>
static DEFINE_SPINLOCK(oom_reaper_lock);

static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
{
        struct mmu_gather tlb;
        struct vm_area_struct *vma;
     2c1:       49 8b 46 50             mov    0x50(%r14),%rax
         */
        set_bit(MMF_UNSTABLE, &mm->flags);

        tlb_gather_mmu(&tlb, mm, 0, -1);
        for (vma = mm->mmap ; vma; vma = vma->vm_next) {
                if (!can_madv_dontneed_vma(vma))
     2c5:       a9 00 24 40 00          test   $0x402400,%eax
     2ca:       75 27                   jne    2f3 <__oom_reap_task_mm+0xd3>
                 * We do not even care about fs backed pages because all
                 * which are reclaimable have already been reclaimed and
                 * we do not want to block exit_mmap by keeping mm ref
                 * count elevated without a good reason.
                 */
                if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
     2cc:       49 83 be 90 00 00 00    cmpq   $0x0,0x90(%r14)
     2d3:       00
     2d4:       74 04                   je     2da <__oom_reap_task_mm+0xba>
     2d6:       a8 08                   test   $0x8,%al
     2d8:       75 19                   jne    2f3 <__oom_reap_task_mm+0xd3>
                        unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
     2da:       49 8b 4e 08             mov    0x8(%r14),%rcx
     2de:       49 8b 16                mov    (%r14),%rdx
     2e1:       48 8d bd 58 ff ff ff    lea    -0xa8(%rbp),%rdi
     2e8:       45 31 c0                xor    %r8d,%r8d
     2eb:       4c 89 f6                mov    %r14,%rsi
     2ee:       e8 00 00 00 00          callq  2f3 <__oom_reap_task_mm+0xd3>
         * if it stumbled over a reaped memory.
----------

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-17 13:30                         ` Tetsuo Handa
  0 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-17 13:30 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, akpm, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 16-06-17 23:26:20, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 16-06-17 19:27:19, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
> > > > > [...]
> > > > > > And the patch you proposed is broken.
> > > > > 
> > > > > Thanks for your testing!
> > > > >  
> > > > > > ----------
> > > > > > [  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or sacrifice child
> > > > > > [  161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > > > > > [  161.858503] ------------[ cut here ]------------
> > > > > > [  161.861512] kernel BUG at mm/memory.c:1381!
> > > > > 
> > > > > BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what is
> > > > > going on here.
> > > > > __oom_reap_task_mm				exit_mmap
> > > > > 						  free_pgtables
> > > > > 						  up_write(mm->mmap_sem)
> > > > >   down_read_trylock(&mm->mmap_sem)
> > > > >   						  remove_vma
> > > > >     unmap_page_range
> > > > > 
> > > > > So we need to extend the mmap_sem coverage. See the updated diff (not
> > > > > the full proper patch yet).
> > > > 
> > > > That diff is still wrong. We need to prevent __oom_reap_task_mm() from calling
> > > > unmap_page_range() when __mmput() already called exit_mm(), by setting/checking
> > > > MMF_OOM_SKIP like shown below.
> > > 
> > > Care to explain why?
> > 
> > I don't know. Your updated diff is causing below oops.
> > 
> > ----------
> > [   90.621890] Out of memory: Kill process 2671 (a.out) score 999 or sacrifice child
> > [   90.624636] Killed process 2671 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > [   90.861308] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > [   90.863695] Modules linked in: coretemp pcspkr sg vmw_vmci shpchp i2c_piix4 sd_mod ata_generic pata_acpi serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mptspi scsi_transport_spi mptscsih ahci mptbase libahci drm e1000 ata_piix i2c_core libata ipv6
> > [   90.870672] CPU: 2 PID: 47 Comm: oom_reaper Not tainted 4.12.0-rc5+ #128
> > [   90.872929] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> > [   90.875995] task: ffff88007b6cd2c0 task.stack: ffff88007b6d0000
> > [   90.878290] RIP: 0010:__oom_reap_task_mm+0xa1/0x160
> 
> What does this dissassemble to on your kernel? Care to post addr2line?

----------
[  114.427451] Out of memory: Kill process 2876 (a.out) score 999 or sacrifice child
[  114.430208] Killed process 2876 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  114.436753] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  114.439129] Modules linked in: pcspkr coretemp sg vmw_vmci i2c_piix4 shpchp sd_mod ata_generic pata_acpi serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci e1000 libahci mptspi scsi_transport_spi drm mptscsih mptbase i2c_core ata_piix libata ipv6
[  114.446220] CPU: 0 PID: 47 Comm: oom_reaper Not tainted 4.12.0-rc5+ #133
[  114.448705] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  114.451695] task: ffff88007b6cd2c0 task.stack: ffff88007b6d0000
[  114.453703] RIP: 0010:__oom_reap_task_mm+0xa1/0x160
[  114.455422] RSP: 0000:ffff88007b6d3df0 EFLAGS: 00010202
[  114.457527] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8800670eaa40 RCX: 0000000000000000
[  114.460002] RDX: ffff88007b6d3e18 RSI: ffff8800670eaa40 RDI: ffff88007b6d3df0
[  114.462206] RBP: ffff88007b6d3e98 R08: ffff88007b6cdb08 R09: ffff88007b6cdad0
[  114.464390] R10: 0000000000000000 R11: 0000000083f54a84 R12: ffff8800670eab00
[  114.466659] R13: ffff880067211bc0 R14: 6b6b6b6b6b6b6b6b R15: ffff8800670eaa40
[  114.469126] FS:  0000000000000000(0000) GS:ffff88007c200000(0000) knlGS:0000000000000000
[  114.471496] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  114.473540] CR2: 00007f55d759d050 CR3: 0000000079ff4000 CR4: 00000000001406f0
[  114.475773] Call Trace:
[  114.477078]  oom_reaper+0xa2/0x1b0 /* oom_reap_task at mm/oom_kill.c:542 (inlined by) oom_reaper at mm/oom_kill.c:580 */
[  114.478569]  ? wake_up_bit+0x30/0x30
[  114.480058]  kthread+0x10d/0x140
[  114.481656]  ? __oom_reap_task_mm+0x160/0x160
[  114.483308]  ? kthread_create_on_node+0x60/0x60
[  114.485075]  ret_from_fork+0x27/0x40
[  114.486620] Code: c3 e8 54 82 f1 ff f0 80 8b 7a 04 00 00 40 48 8d bd 58 ff ff ff 48 83 c9 ff 31 d2 48 89 de e8 57 12 03 00 4c 8b 33 4d 85 f6 74 3b <49> 8b 46 50 a9 00 24 40 00 75 27 49 83 be 90 00 00 00 00 74 04 
[  114.491819] RIP: __oom_reap_task_mm+0xa1/0x160 RSP: ffff88007b6d3df0
[  114.494520] ---[ end trace e254efa6cf6f5fe6 ]---
----------

The __oom_reap_task_mm+0xa1/0x160 is __oom_reap_task_mm at mm/oom_kill.c:472
which is "struct vm_area_struct *vma;" line in __oom_reap_task_mm().
The __oom_reap_task_mm+0xb1/0x160 is __oom_reap_task_mm at mm/oom_kill.c:519
which is "if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))" line.
The <49> 8b 46 50 is "vma->vm_flags" in can_madv_dontneed_vma(vma) from __oom_reap_task_mm().

Is it safe for the OOM reaper to call tlb_gather_mmu()/unmap_page_range()/tlb_finish_mmu() sequence
after the OOM victim already completed tlb_gather_mmu()/unmap_vmas()/free_pgtables()/tlb_finish_mmu()/
remove_vma() sequence from exit_mmap() from __mmput() from mmput() from exit_mm() from do_exit() ?
I guess we need to prevent the OOM reaper from calling the sequence if the OOM victim already did
the sequence. And my patch did it via trial and error.

----------
unlock_oom:
        mutex_unlock(&oom_lock);
     26a:       48 c7 c7 00 00 00 00    mov    $0x0,%rdi
     271:       e8 00 00 00 00          callq  276 <__oom_reap_task_mm+0x56>
        return ret;
}
     276:       48 8b 55 d8             mov    -0x28(%rbp),%rdx
     27a:       65 48 33 14 25 28 00    xor    %gs:0x28,%rdx
     281:       00 00
     283:       89 d8                   mov    %ebx,%eax
     285:       75 10                   jne    297 <__oom_reap_task_mm+0x77>
     287:       48 81 c4 88 00 00 00    add    $0x88,%rsp
     28e:       5b                      pop    %rbx
     28f:       41 5c                   pop    %r12
     291:       41 5d                   pop    %r13
     293:       41 5e                   pop    %r14
     295:       5d                      pop    %rbp
     296:       c3                      retq
     297:       e8 00 00 00 00          callq  29c <__oom_reap_task_mm+0x7c>
 */
static __always_inline void
set_bit(long nr, volatile unsigned long *addr)
{
        if (IS_IMMEDIATE(nr)) {
                asm volatile(LOCK_PREFIX "orb %1,%0"
     29c:       f0 80 8b 7a 04 00 00    lock orb $0x40,0x47a(%rbx)
     2a3:       40
         * should imply barriers already and the reader would hit a page fault
         * if it stumbled over a reaped memory.
         */
        set_bit(MMF_UNSTABLE, &mm->flags);

        tlb_gather_mmu(&tlb, mm, 0, -1);
     2a4:       48 8d bd 58 ff ff ff    lea    -0xa8(%rbp),%rdi
     2ab:       48 83 c9 ff             or     $0xffffffffffffffff,%rcx
     2af:       31 d2                   xor    %edx,%edx
     2b1:       48 89 de                mov    %rbx,%rsi
     2b4:       e8 00 00 00 00          callq  2b9 <__oom_reap_task_mm+0x99>
        for (vma = mm->mmap ; vma; vma = vma->vm_next) {
     2b9:       4c 8b 33                mov    (%rbx),%r14
     2bc:       4d 85 f6                test   %r14,%r14
     2bf:       74 3b                   je     2fc <__oom_reap_task_mm+0xdc>
static DEFINE_SPINLOCK(oom_reaper_lock);

static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
{
        struct mmu_gather tlb;
        struct vm_area_struct *vma;
     2c1:       49 8b 46 50             mov    0x50(%r14),%rax
         */
        set_bit(MMF_UNSTABLE, &mm->flags);

        tlb_gather_mmu(&tlb, mm, 0, -1);
        for (vma = mm->mmap ; vma; vma = vma->vm_next) {
                if (!can_madv_dontneed_vma(vma))
     2c5:       a9 00 24 40 00          test   $0x402400,%eax
     2ca:       75 27                   jne    2f3 <__oom_reap_task_mm+0xd3>
                 * We do not even care about fs backed pages because all
                 * which are reclaimable have already been reclaimed and
                 * we do not want to block exit_mmap by keeping mm ref
                 * count elevated without a good reason.
                 */
                if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
     2cc:       49 83 be 90 00 00 00    cmpq   $0x0,0x90(%r14)
     2d3:       00
     2d4:       74 04                   je     2da <__oom_reap_task_mm+0xba>
     2d6:       a8 08                   test   $0x8,%al
     2d8:       75 19                   jne    2f3 <__oom_reap_task_mm+0xd3>
                        unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
     2da:       49 8b 4e 08             mov    0x8(%r14),%rcx
     2de:       49 8b 16                mov    (%r14),%rdx
     2e1:       48 8d bd 58 ff ff ff    lea    -0xa8(%rbp),%rdi
     2e8:       45 31 c0                xor    %r8d,%r8d
     2eb:       4c 89 f6                mov    %r14,%rsi
     2ee:       e8 00 00 00 00          callq  2f3 <__oom_reap_task_mm+0xd3>
         * if it stumbled over a reaped memory.
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH] mm,oom_kill: Close race window of needlessly selecting new victims.
  2017-06-17  5:17             ` Tetsuo Handa
@ 2017-06-20 22:12               ` David Rientjes
  -1 siblings, 0 replies; 70+ messages in thread
From: David Rientjes @ 2017-06-20 22:12 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, akpm, linux-mm, linux-kernel

On Sat, 17 Jun 2017, Tetsuo Handa wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 04c9143..cf1d331 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -470,38 +470,9 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  {
>  	struct mmu_gather tlb;
>  	struct vm_area_struct *vma;
> -	bool ret = true;
> -
> -	/*
> -	 * We have to make sure to not race with the victim exit path
> -	 * and cause premature new oom victim selection:
> -	 * __oom_reap_task_mm		exit_mm
> -	 *   mmget_not_zero
> -	 *				  mmput
> -	 *				    atomic_dec_and_test
> -	 *				  exit_oom_victim
> -	 *				[...]
> -	 *				out_of_memory
> -	 *				  select_bad_process
> -	 *				    # no TIF_MEMDIE task selects new victim
> -	 *  unmap_page_range # frees some memory
> -	 */
> -	mutex_lock(&oom_lock);
> -
> -	if (!down_read_trylock(&mm->mmap_sem)) {
> -		ret = false;
> -		goto unlock_oom;
> -	}
>  
> -	/*
> -	 * increase mm_users only after we know we will reap something so
> -	 * that the mmput_async is called only when we have reaped something
> -	 * and delayed __mmput doesn't matter that much
> -	 */
> -	if (!mmget_not_zero(mm)) {
> -		up_read(&mm->mmap_sem);
> -		goto unlock_oom;
> -	}
> +	if (!down_read_trylock(&mm->mmap_sem))
> +		return false;
>  
>  	/*
>  	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -537,16 +508,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  			K(get_mm_counter(mm, MM_FILEPAGES)),
>  			K(get_mm_counter(mm, MM_SHMEMPAGES)));
>  	up_read(&mm->mmap_sem);
> -
> -	/*
> -	 * Drop our reference but make sure the mmput slow path is called from a
> -	 * different context because we shouldn't risk we get stuck there and
> -	 * put the oom_reaper out of the way.
> -	 */
> -	mmput_async(mm);
> -unlock_oom:
> -	mutex_unlock(&oom_lock);
> -	return ret;
> +	return true;
>  }
>  
>  #define MAX_OOM_REAP_RETRIES 10
> @@ -569,12 +531,31 @@ static void oom_reap_task(struct task_struct *tsk)
>  
>  done:
>  	tsk->oom_reaper_list = NULL;
> +	/*
> +	 * Drop a mm_users reference taken by mark_oom_victim().
> +	 * A mm_count reference taken by mark_oom_victim() remains.
> +	 */
> +	mmput_async(mm);

This doesn't prevent serial oom killing for either the system oom killer 
or for the memcg oom killer.

The oom killer cannot detect tsk_is_oom_victim() if the task has either 
been removed from the tasklist or has already done cgroup_exit().  For 
memcg oom killings in particular, cgroup_exit() is usually called very 
shortly after the oom killer has sent the SIGKILL.  If the oom reaper does 
not fail (for example by failing to grab mm->mmap_sem) before another 
memcg charge after cgroup_exit(victim), additional processes are killed 
because the iteration does not view the victim.

This easily kills all processes attached to the memcg with no memory 
freeing from any victim.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH] mm,oom_kill: Close race window of needlessly selecting new victims.
@ 2017-06-20 22:12               ` David Rientjes
  0 siblings, 0 replies; 70+ messages in thread
From: David Rientjes @ 2017-06-20 22:12 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, akpm, linux-mm, linux-kernel

On Sat, 17 Jun 2017, Tetsuo Handa wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 04c9143..cf1d331 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -470,38 +470,9 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  {
>  	struct mmu_gather tlb;
>  	struct vm_area_struct *vma;
> -	bool ret = true;
> -
> -	/*
> -	 * We have to make sure to not race with the victim exit path
> -	 * and cause premature new oom victim selection:
> -	 * __oom_reap_task_mm		exit_mm
> -	 *   mmget_not_zero
> -	 *				  mmput
> -	 *				    atomic_dec_and_test
> -	 *				  exit_oom_victim
> -	 *				[...]
> -	 *				out_of_memory
> -	 *				  select_bad_process
> -	 *				    # no TIF_MEMDIE task selects new victim
> -	 *  unmap_page_range # frees some memory
> -	 */
> -	mutex_lock(&oom_lock);
> -
> -	if (!down_read_trylock(&mm->mmap_sem)) {
> -		ret = false;
> -		goto unlock_oom;
> -	}
>  
> -	/*
> -	 * increase mm_users only after we know we will reap something so
> -	 * that the mmput_async is called only when we have reaped something
> -	 * and delayed __mmput doesn't matter that much
> -	 */
> -	if (!mmget_not_zero(mm)) {
> -		up_read(&mm->mmap_sem);
> -		goto unlock_oom;
> -	}
> +	if (!down_read_trylock(&mm->mmap_sem))
> +		return false;
>  
>  	/*
>  	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -537,16 +508,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  			K(get_mm_counter(mm, MM_FILEPAGES)),
>  			K(get_mm_counter(mm, MM_SHMEMPAGES)));
>  	up_read(&mm->mmap_sem);
> -
> -	/*
> -	 * Drop our reference but make sure the mmput slow path is called from a
> -	 * different context because we shouldn't risk we get stuck there and
> -	 * put the oom_reaper out of the way.
> -	 */
> -	mmput_async(mm);
> -unlock_oom:
> -	mutex_unlock(&oom_lock);
> -	return ret;
> +	return true;
>  }
>  
>  #define MAX_OOM_REAP_RETRIES 10
> @@ -569,12 +531,31 @@ static void oom_reap_task(struct task_struct *tsk)
>  
>  done:
>  	tsk->oom_reaper_list = NULL;
> +	/*
> +	 * Drop a mm_users reference taken by mark_oom_victim().
> +	 * A mm_count reference taken by mark_oom_victim() remains.
> +	 */
> +	mmput_async(mm);

This doesn't prevent serial oom killing for either the system oom killer 
or for the memcg oom killer.

The oom killer cannot detect tsk_is_oom_victim() if the task has either 
been removed from the tasklist or has already done cgroup_exit().  For 
memcg oom killings in particular, cgroup_exit() is usually called very 
shortly after the oom killer has sent the SIGKILL.  If the oom reaper does 
not fail (for example by failing to grab mm->mmap_sem) before another 
memcg charge after cgroup_exit(victim), additional processes are killed 
because the iteration does not view the victim.

This easily kills all processes attached to the memcg with no memory 
freeing from any victim.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Re: [PATCH] mm,oom_kill: Close race window of needlessly selecting new victims.
  2017-06-20 22:12               ` David Rientjes
  (?)
@ 2017-06-21  2:17               ` Tetsuo Handa
  2017-06-21 20:31                   ` David Rientjes
  -1 siblings, 1 reply; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-21  2:17 UTC (permalink / raw)
  To: David Rientjes; +Cc: mhocko, akpm, linux-mm, linux-kernel

David Rientjes wrote:
> This doesn't prevent serial oom killing for either the system oom killer 
> or for the memcg oom killer.
> 
> The oom killer cannot detect tsk_is_oom_victim() if the task has either 
> been removed from the tasklist or has already done cgroup_exit().  For 
> memcg oom killings in particular, cgroup_exit() is usually called very 
> shortly after the oom killer has sent the SIGKILL.  If the oom reaper does 
> not fail (for example by failing to grab mm->mmap_sem) before another 
> memcg charge after cgroup_exit(victim), additional processes are killed 
> because the iteration does not view the victim.
> 
> This easily kills all processes attached to the memcg with no memory 
> freeing from any victim.

Umm... So, you are pointing out that select_bad_process() aborts based on
TIF_MEMDIE or MMF_OOM_SKIP is broken because victim threads can be removed
 from global task list or cgroup's task list. Then, the OOM killer will have to
wait until all mm_struct of interested OOM domain (system wide or some cgroup)
is reaped by the OOM reaper. Simplest way is to wait until all mm_struct are
reaped by the OOM reaper, for currently we are not tracking which memory cgroup
each mm_struct belongs to, are we? But that can cause needless delay when
multiple OOM events occurred in different OOM domains. Do we want to (and can we)
make it possible to tell whether each mm_struct queued to the OOM reaper's list
belongs to the thread calling out_of_memory() ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH] mm,oom_kill: Close race window of needlessly selecting new victims.
  2017-06-20 22:12               ` David Rientjes
@ 2017-06-21 13:18                 ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-21 13:18 UTC (permalink / raw)
  To: David Rientjes; +Cc: Tetsuo Handa, akpm, linux-mm, linux-kernel

On Tue 20-06-17 15:12:55, David Rientjes wrote:
[...]
> This doesn't prevent serial oom killing for either the system oom killer 
> or for the memcg oom killer.
> 
> The oom killer cannot detect tsk_is_oom_victim() if the task has either 
> been removed from the tasklist or has already done cgroup_exit(). For 
> memcg oom killings in particular, cgroup_exit() is usually called very 
> shortly after the oom killer has sent the SIGKILL.  If the oom reaper does 
> not fail (for example by failing to grab mm->mmap_sem) before another 
> memcg charge after cgroup_exit(victim), additional processes are killed 
> because the iteration does not view the victim.
> 
> This easily kills all processes attached to the memcg with no memory 
> freeing from any victim.

It took me some time to decrypt the above but you are right. Pinning
mm_users will prevent exit path to exit_mmap and that can indeed cause
another premature oom killing because the task might be unhashed or
removed from the memcg before the oom reaper has a chance to reap the
task. Thanks for pointing this out. This means that we either have to
reimplement the unhashing/cgroup_exit for oom victims or get back to
allowing oom reaper to race with exit_mmap. The later sounds much more
easier to me.

I was offline last two days but I will revisit my original idea ASAP.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH] mm,oom_kill: Close race window of needlessly selecting new victims.
@ 2017-06-21 13:18                 ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-21 13:18 UTC (permalink / raw)
  To: David Rientjes; +Cc: Tetsuo Handa, akpm, linux-mm, linux-kernel

On Tue 20-06-17 15:12:55, David Rientjes wrote:
[...]
> This doesn't prevent serial oom killing for either the system oom killer 
> or for the memcg oom killer.
> 
> The oom killer cannot detect tsk_is_oom_victim() if the task has either 
> been removed from the tasklist or has already done cgroup_exit(). For 
> memcg oom killings in particular, cgroup_exit() is usually called very 
> shortly after the oom killer has sent the SIGKILL.  If the oom reaper does 
> not fail (for example by failing to grab mm->mmap_sem) before another 
> memcg charge after cgroup_exit(victim), additional processes are killed 
> because the iteration does not view the victim.
> 
> This easily kills all processes attached to the memcg with no memory 
> freeing from any victim.

It took me some time to decrypt the above but you are right. Pinning
mm_users will prevent exit path to exit_mmap and that can indeed cause
another premature oom killing because the task might be unhashed or
removed from the memcg before the oom reaper has a chance to reap the
task. Thanks for pointing this out. This means that we either have to
reimplement the unhashing/cgroup_exit for oom victims or get back to
allowing oom reaper to race with exit_mmap. The later sounds much more
easier to me.

I was offline last two days but I will revisit my original idea ASAP.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Re: [PATCH] mm,oom_kill: Close race window of needlessly selecting new victims.
  2017-06-21  2:17               ` Tetsuo Handa
@ 2017-06-21 20:31                   ` David Rientjes
  0 siblings, 0 replies; 70+ messages in thread
From: David Rientjes @ 2017-06-21 20:31 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, akpm, linux-mm, linux-kernel

On Wed, 21 Jun 2017, Tetsuo Handa wrote:

> Umm... So, you are pointing out that select_bad_process() aborts based on
> TIF_MEMDIE or MMF_OOM_SKIP is broken because victim threads can be removed
>  from global task list or cgroup's task list. Then, the OOM killer will have to
> wait until all mm_struct of interested OOM domain (system wide or some cgroup)
> is reaped by the OOM reaper. Simplest way is to wait until all mm_struct are
> reaped by the OOM reaper, for currently we are not tracking which memory cgroup
> each mm_struct belongs to, are we? But that can cause needless delay when
> multiple OOM events occurred in different OOM domains. Do we want to (and can we)
> make it possible to tell whether each mm_struct queued to the OOM reaper's list
> belongs to the thread calling out_of_memory() ?
> 

I am saying that taking mmget() in mark_oom_victim() and then only 
dropping it with mmput_async() after it can grab mm->mmap_sem, which the 
exit path itself takes, or the oom reaper happens to schedule, causes 
__mmput() to be called much later and thus we remove the process from the 
tasklist or call cgroup_exit() earlier than the memory can be unmapped 
with your patch.  As a result, subsequent calls to the oom killer kills 
everything before the original victim's mm can undergo __mmput() because 
the oom reaper still holds the reference.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Re: [PATCH] mm,oom_kill: Close race window of needlessly selecting new victims.
@ 2017-06-21 20:31                   ` David Rientjes
  0 siblings, 0 replies; 70+ messages in thread
From: David Rientjes @ 2017-06-21 20:31 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, akpm, linux-mm, linux-kernel

On Wed, 21 Jun 2017, Tetsuo Handa wrote:

> Umm... So, you are pointing out that select_bad_process() aborts based on
> TIF_MEMDIE or MMF_OOM_SKIP is broken because victim threads can be removed
>  from global task list or cgroup's task list. Then, the OOM killer will have to
> wait until all mm_struct of interested OOM domain (system wide or some cgroup)
> is reaped by the OOM reaper. Simplest way is to wait until all mm_struct are
> reaped by the OOM reaper, for currently we are not tracking which memory cgroup
> each mm_struct belongs to, are we? But that can cause needless delay when
> multiple OOM events occurred in different OOM domains. Do we want to (and can we)
> make it possible to tell whether each mm_struct queued to the OOM reaper's list
> belongs to the thread calling out_of_memory() ?
> 

I am saying that taking mmget() in mark_oom_victim() and then only 
dropping it with mmput_async() after it can grab mm->mmap_sem, which the 
exit path itself takes, or the oom reaper happens to schedule, causes 
__mmput() to be called much later and thus we remove the process from the 
tasklist or call cgroup_exit() earlier than the memory can be unmapped 
with your patch.  As a result, subsequent calls to the oom killer kills 
everything before the original victim's mm can undergo __mmput() because 
the oom reaper still holds the reference.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH] mm,oom_kill: Close race window of needlessly selecting new victims.
  2017-06-21 20:31                   ` David Rientjes
  (?)
@ 2017-06-22  0:53                   ` Tetsuo Handa
  2017-06-23 12:45                       ` Michal Hocko
  -1 siblings, 1 reply; 70+ messages in thread
From: Tetsuo Handa @ 2017-06-22  0:53 UTC (permalink / raw)
  To: David Rientjes; +Cc: mhocko, akpm, linux-mm, linux-kernel

David Rientjes wrote:
> On Wed, 21 Jun 2017, Tetsuo Handa wrote:
> > Umm... So, you are pointing out that select_bad_process() aborts based on
> > TIF_MEMDIE or MMF_OOM_SKIP is broken because victim threads can be removed
> >  from global task list or cgroup's task list. Then, the OOM killer will have to
> > wait until all mm_struct of interested OOM domain (system wide or some cgroup)
> > is reaped by the OOM reaper. Simplest way is to wait until all mm_struct are
> > reaped by the OOM reaper, for currently we are not tracking which memory cgroup
> > each mm_struct belongs to, are we? But that can cause needless delay when
> > multiple OOM events occurred in different OOM domains. Do we want to (and can we)
> > make it possible to tell whether each mm_struct queued to the OOM reaper's list
> > belongs to the thread calling out_of_memory() ?
> > 
> 
> I am saying that taking mmget() in mark_oom_victim() and then only 
> dropping it with mmput_async() after it can grab mm->mmap_sem, which the 
> exit path itself takes, or the oom reaper happens to schedule, causes 
> __mmput() to be called much later and thus we remove the process from the 
> tasklist or call cgroup_exit() earlier than the memory can be unmapped 
> with your patch.  As a result, subsequent calls to the oom killer kills 
> everything before the original victim's mm can undergo __mmput() because 
> the oom reaper still holds the reference.

Here is "wait for all mm_struct are reaped by the OOM reaper" version.

 include/linux/sched.h |   3 -
 mm/oom_kill.c         | 150 ++++++++++++++++++++++++--------------------------
 2 files changed, 71 insertions(+), 82 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2b69fc6..0d9904e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1034,9 +1034,6 @@ struct task_struct {
 	unsigned long			task_state_change;
 #endif
 	int				pagefault_disabled;
-#ifdef CONFIG_MMU
-	struct task_struct		*oom_reaper_list;
-#endif
 #ifdef CONFIG_VMAP_STACK
 	struct vm_struct		*stack_vm_area;
 #endif
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143..fb0b8dc 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -296,6 +296,7 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	if (oom_unkillable_task(task, NULL, oc->nodemask))
 		goto next;
 
+#ifndef CONFIG_MMU
 	/*
 	 * This task already has access to memory reserves and is being killed.
 	 * Don't allow any other task to have access to the reserves unless
@@ -307,6 +308,7 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 			goto next;
 		goto abort;
 	}
+#endif
 
 	/*
 	 * If task is allocating a lot of memory and has been marked to be
@@ -332,11 +334,13 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	oc->chosen_points = points;
 next:
 	return 0;
+#ifndef CONFIG_MMU
 abort:
 	if (oc->chosen)
 		put_task_struct(oc->chosen);
 	oc->chosen = (void *)-1UL;
 	return 1;
+#endif
 }
 
 /*
@@ -463,45 +467,17 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
  */
 static struct task_struct *oom_reaper_th;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
-static DEFINE_SPINLOCK(oom_reaper_lock);
+static struct mm_struct *oom_mm;
+static char oom_mm_owner_comm[TASK_COMM_LEN];
+static pid_t oom_mm_owner_pid;
 
-static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
+static bool __oom_reap_mm(struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	bool ret = true;
-
-	/*
-	 * We have to make sure to not race with the victim exit path
-	 * and cause premature new oom victim selection:
-	 * __oom_reap_task_mm		exit_mm
-	 *   mmget_not_zero
-	 *				  mmput
-	 *				    atomic_dec_and_test
-	 *				  exit_oom_victim
-	 *				[...]
-	 *				out_of_memory
-	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
-	 *  unmap_page_range # frees some memory
-	 */
-	mutex_lock(&oom_lock);
-
-	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
-		goto unlock_oom;
-	}
 
-	/*
-	 * increase mm_users only after we know we will reap something so
-	 * that the mmput_async is called only when we have reaped something
-	 * and delayed __mmput doesn't matter that much
-	 */
-	if (!mmget_not_zero(mm)) {
-		up_read(&mm->mmap_sem);
-		goto unlock_oom;
-	}
+	if (!down_read_trylock(&mm->mmap_sem))
+		return false;
 
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
@@ -532,89 +508,71 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
-			task_pid_nr(tsk), tsk->comm,
+			oom_mm_owner_pid, oom_mm_owner_comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
-
-	/*
-	 * Drop our reference but make sure the mmput slow path is called from a
-	 * different context because we shouldn't risk we get stuck there and
-	 * put the oom_reaper out of the way.
-	 */
-	mmput_async(mm);
-unlock_oom:
-	mutex_unlock(&oom_lock);
-	return ret;
+	return true;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
-static void oom_reap_task(struct task_struct *tsk)
+static void oom_reap_mm(struct mm_struct *mm)
 {
 	int attempts = 0;
-	struct mm_struct *mm = tsk->signal->oom_mm;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
+	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_mm(mm))
 		schedule_timeout_idle(HZ/10);
 
 	if (attempts <= MAX_OOM_REAP_RETRIES)
 		goto done;
 
-
 	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-		task_pid_nr(tsk), tsk->comm);
+		oom_mm_owner_pid, oom_mm_owner_comm);
 	debug_show_all_locks();
 
 done:
-	tsk->oom_reaper_list = NULL;
-
 	/*
 	 * Hide this mm from OOM killer because it has been either reaped or
 	 * somebody can't call up_write(mmap_sem).
 	 */
 	set_bit(MMF_OOM_SKIP, &mm->flags);
 
-	/* Drop a reference taken by wake_oom_reaper */
-	put_task_struct(tsk);
+	/*
+	 * Drop a mm_users reference taken by mark_oom_victim().
+	 * A mm_count reference taken by mark_oom_victim() remains.
+	 */
+	mmput_async(mm);
 }
 
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct task_struct *tsk = NULL;
-
-		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
-		spin_lock(&oom_reaper_lock);
-		if (oom_reaper_list != NULL) {
-			tsk = oom_reaper_list;
-			oom_reaper_list = tsk->oom_reaper_list;
-		}
-		spin_unlock(&oom_reaper_lock);
-
-		if (tsk)
-			oom_reap_task(tsk);
+		wait_event(oom_reaper_wait, oom_mm);
+		oom_reap_mm(oom_mm);
+		mutex_lock(&oom_lock);
+		oom_mm = NULL;
+		mutex_unlock(&oom_lock);
 	}
-
 	return 0;
 }
 
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	if (!oom_reaper_th)
-		return;
-
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+	/*
+	 * Since only tsk == current case can reach here when oom_mm != NULL,
+	 * the OOM reaper will reap current->mm on behalf of current thread if
+	 * oom_mm != NULL. Thus, just drop a mm_users reference taken by
+	 * mark_oom_victim().
+	 */
+	if (!oom_reaper_th || oom_mm) {
+		mmput_async(tsk->signal->oom_mm);
 		return;
-
-	get_task_struct(tsk);
-
-	spin_lock(&oom_reaper_lock);
-	tsk->oom_reaper_list = oom_reaper_list;
-	oom_reaper_list = tsk;
-	spin_unlock(&oom_reaper_lock);
+	}
+	strlcpy(oom_mm_owner_comm, tsk->comm, sizeof(oom_mm_owner_comm));
+	oom_mm_owner_pid = task_pid_nr(tsk);
+	oom_mm = tsk->signal->oom_mm;
 	wake_up(&oom_reaper_wait);
 }
 
@@ -650,12 +608,32 @@ static void mark_oom_victim(struct task_struct *tsk)
 	struct mm_struct *mm = tsk->mm;
 
 	WARN_ON(oom_killer_disabled);
+#ifdef CONFIG_MMU
+	/*
+	 * Take a mm_users reference so that __oom_reap_mm() can unmap
+	 * pages without risking a race condition where final mmput() from
+	 * exit_mm() from do_exit() triggered __mmput() and gets stuck there
+	 * (but __oom_reap_mm() cannot unmap pages due to mm_users == 0).
+	 *
+	 * Since all callers guarantee that this mm is stable (hold task_lock
+	 * or tsk == current), we can safely use mmget() here.
+	 *
+	 * When dropping this reference, mmput_async() has to be used because
+	 * __mmput() can get stuck which in turn keeps the OOM killer/reaper
+	 * disabled forever.
+	 */
+	mmget(mm);
+#endif
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
 
 	/* oom_mm is bound to the signal struct life time. */
 	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
+		/*
+		 * Take a mm_count reference so that we can examine flags value
+		 * when tsk_is_oom_victim() is true.
+		 */
 		mmgrab(tsk->signal->oom_mm);
 
 	/*
@@ -908,6 +886,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 		if (is_global_init(p)) {
 			can_oom_reap = false;
 			set_bit(MMF_OOM_SKIP, &mm->flags);
+#ifdef CONFIG_MMU
+			mmput_async(mm);
+#endif
 			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
 					task_pid_nr(victim), victim->comm,
 					task_pid_nr(p), p->comm);
@@ -1005,6 +986,17 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
+#ifdef CONFIG_MMU
+	/*
+	 * Wait for the OOM reaper to reap existing OOM victim's mm in order
+	 * to avoid selecting next OOM victims prematurely. This will block
+	 * OOM events in different domains and SysRq-f, but this should be no
+	 * problem because the OOM reaper is guaranteed not to wait forever.
+	 */
+	if (oom_mm)
+		return true;
+#endif
+
 	/*
 	 * The OOM killer does not compensate for IO-less reclaim.
 	 * pagefault_out_of_memory lost its gfp context so we have to
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed
  2017-06-17 13:30                         ` Tetsuo Handa
@ 2017-06-23 12:38                           ` Michal Hocko
  -1 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-23 12:38 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, linux-mm, linux-kernel

On Sat 17-06-17 22:30:31, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > What does this dissassemble to on your kernel? Care to post addr2line?
> 
[...]
> The __oom_reap_task_mm+0xa1/0x160 is __oom_reap_task_mm at mm/oom_kill.c:472
> which is "struct vm_area_struct *vma;" line in __oom_reap_task_mm().
> The __oom_reap_task_mm+0xb1/0x160 is __oom_reap_task_mm at mm/oom_kill.c:519
> which is "if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))" line.
> The <49> 8b 46 50 is "vma->vm_flags" in can_madv_dontneed_vma(vma) from __oom_reap_task_mm().

OK, I see what is going on here. I could have noticed earlier. Sorry my
fault. We are simply accessing a stale mm->mmap. exit_mmap() does
remove_vma which frees all the vmas but it doesn't reset mm->mmap to
NULL. Trivial to fix.

diff --git a/mm/mmap.c b/mm/mmap.c
index ca58f8a2a217..253808e716dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2979,6 +2979,7 @@ void exit_mmap(struct mm_struct *mm)
 			nr_accounted += vma_pages(vma);
 		vma = remove_vma(vma);
 	}
+	mm->mmap = NULL;
 	vm_unacct_memory(nr_accounted);
 	up_write(&mm->mmap_sem);
 }

> Is it safe for the OOM reaper to call tlb_gather_mmu()/unmap_page_range()/tlb_finish_mmu() sequence
> after the OOM victim already completed tlb_gather_mmu()/unmap_vmas()/free_pgtables()/tlb_finish_mmu()/
> remove_vma() sequence from exit_mmap() from __mmput() from mmput() from exit_mm() from do_exit() ?

It is safe to race until unmap_vmas because that only needs mmap_sem for
read mode (e.g. madvise MADV_DONTNEED) and all the later operations have
to be linearized because we cannot tear down page tables while the oom
reaper is doing pte walk. After we drop mmap_sem for write in the
exit_mmap then there are no vmas and so there is nothing to do in the
reaper.

I will give the patch more testing next week. This one was busy as hell
(i was travelling and then the stack gap thingy...).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed
@ 2017-06-23 12:38                           ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-23 12:38 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, linux-mm, linux-kernel

On Sat 17-06-17 22:30:31, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > What does this dissassemble to on your kernel? Care to post addr2line?
> 
[...]
> The __oom_reap_task_mm+0xa1/0x160 is __oom_reap_task_mm at mm/oom_kill.c:472
> which is "struct vm_area_struct *vma;" line in __oom_reap_task_mm().
> The __oom_reap_task_mm+0xb1/0x160 is __oom_reap_task_mm at mm/oom_kill.c:519
> which is "if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))" line.
> The <49> 8b 46 50 is "vma->vm_flags" in can_madv_dontneed_vma(vma) from __oom_reap_task_mm().

OK, I see what is going on here. I could have noticed earlier. Sorry my
fault. We are simply accessing a stale mm->mmap. exit_mmap() does
remove_vma which frees all the vmas but it doesn't reset mm->mmap to
NULL. Trivial to fix.

diff --git a/mm/mmap.c b/mm/mmap.c
index ca58f8a2a217..253808e716dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2979,6 +2979,7 @@ void exit_mmap(struct mm_struct *mm)
 			nr_accounted += vma_pages(vma);
 		vma = remove_vma(vma);
 	}
+	mm->mmap = NULL;
 	vm_unacct_memory(nr_accounted);
 	up_write(&mm->mmap_sem);
 }

> Is it safe for the OOM reaper to call tlb_gather_mmu()/unmap_page_range()/tlb_finish_mmu() sequence
> after the OOM victim already completed tlb_gather_mmu()/unmap_vmas()/free_pgtables()/tlb_finish_mmu()/
> remove_vma() sequence from exit_mmap() from __mmput() from mmput() from exit_mm() from do_exit() ?

It is safe to race until unmap_vmas because that only needs mmap_sem for
read mode (e.g. madvise MADV_DONTNEED) and all the later operations have
to be linearized because we cannot tear down page tables while the oom
reaper is doing pte walk. After we drop mmap_sem for write in the
exit_mmap then there are no vmas and so there is nothing to do in the
reaper.

I will give the patch more testing next week. This one was busy as hell
(i was travelling and then the stack gap thingy...).
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH] mm,oom_kill: Close race window of needlessly selecting new victims.
  2017-06-22  0:53                   ` Tetsuo Handa
@ 2017-06-23 12:45                       ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-23 12:45 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: David Rientjes, akpm, linux-mm, linux-kernel

On Thu 22-06-17 09:53:48, Tetsuo Handa wrote:
> David Rientjes wrote:
> > On Wed, 21 Jun 2017, Tetsuo Handa wrote:
> > > Umm... So, you are pointing out that select_bad_process() aborts based on
> > > TIF_MEMDIE or MMF_OOM_SKIP is broken because victim threads can be removed
> > >  from global task list or cgroup's task list. Then, the OOM killer will have to
> > > wait until all mm_struct of interested OOM domain (system wide or some cgroup)
> > > is reaped by the OOM reaper. Simplest way is to wait until all mm_struct are
> > > reaped by the OOM reaper, for currently we are not tracking which memory cgroup
> > > each mm_struct belongs to, are we? But that can cause needless delay when
> > > multiple OOM events occurred in different OOM domains. Do we want to (and can we)
> > > make it possible to tell whether each mm_struct queued to the OOM reaper's list
> > > belongs to the thread calling out_of_memory() ?
> > > 
> > 
> > I am saying that taking mmget() in mark_oom_victim() and then only 
> > dropping it with mmput_async() after it can grab mm->mmap_sem, which the 
> > exit path itself takes, or the oom reaper happens to schedule, causes 
> > __mmput() to be called much later and thus we remove the process from the 
> > tasklist or call cgroup_exit() earlier than the memory can be unmapped 
> > with your patch.  As a result, subsequent calls to the oom killer kills 
> > everything before the original victim's mm can undergo __mmput() because 
> > the oom reaper still holds the reference.
> 
> Here is "wait for all mm_struct are reaped by the OOM reaper" version.

Well, this is getting more and more hairy. I think we should explore the
possibility of oom_reaper vs. exit_mmap working together after all.

Yes, I've said that a solution fully withing the oom proper would be
preferable but this just grows into complex hairy mess. Maybe we just
find out that oom_reaper vs. exit_mmap is just not feasible and we will
reconsider this approach in the end but let's try a clean solution
first. As I've said there is nothing fundamentally hard about parallel
unmapping MADV_DONTNEED does that already. We just have to iron out
those tiny details.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH] mm,oom_kill: Close race window of needlessly selecting new victims.
@ 2017-06-23 12:45                       ` Michal Hocko
  0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2017-06-23 12:45 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: David Rientjes, akpm, linux-mm, linux-kernel

On Thu 22-06-17 09:53:48, Tetsuo Handa wrote:
> David Rientjes wrote:
> > On Wed, 21 Jun 2017, Tetsuo Handa wrote:
> > > Umm... So, you are pointing out that select_bad_process() aborts based on
> > > TIF_MEMDIE or MMF_OOM_SKIP is broken because victim threads can be removed
> > >  from global task list or cgroup's task list. Then, the OOM killer will have to
> > > wait until all mm_struct of interested OOM domain (system wide or some cgroup)
> > > is reaped by the OOM reaper. Simplest way is to wait until all mm_struct are
> > > reaped by the OOM reaper, for currently we are not tracking which memory cgroup
> > > each mm_struct belongs to, are we? But that can cause needless delay when
> > > multiple OOM events occurred in different OOM domains. Do we want to (and can we)
> > > make it possible to tell whether each mm_struct queued to the OOM reaper's list
> > > belongs to the thread calling out_of_memory() ?
> > > 
> > 
> > I am saying that taking mmget() in mark_oom_victim() and then only 
> > dropping it with mmput_async() after it can grab mm->mmap_sem, which the 
> > exit path itself takes, or the oom reaper happens to schedule, causes 
> > __mmput() to be called much later and thus we remove the process from the 
> > tasklist or call cgroup_exit() earlier than the memory can be unmapped 
> > with your patch.  As a result, subsequent calls to the oom killer kills 
> > everything before the original victim's mm can undergo __mmput() because 
> > the oom reaper still holds the reference.
> 
> Here is "wait for all mm_struct are reaped by the OOM reaper" version.

Well, this is getting more and more hairy. I think we should explore the
possibility of oom_reaper vs. exit_mmap working together after all.

Yes, I've said that a solution fully withing the oom proper would be
preferable but this just grows into complex hairy mess. Maybe we just
find out that oom_reaper vs. exit_mmap is just not feasible and we will
reconsider this approach in the end but let's try a clean solution
first. As I've said there is nothing fundamentally hard about parallel
unmapping MADV_DONTNEED does that already. We just have to iron out
those tiny details.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2017-06-23 12:45 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-14 23:43 [patch] mm, oom: prevent additional oom kills before memory is freed David Rientjes
2017-06-14 23:43 ` David Rientjes
2017-06-15 10:39 ` Michal Hocko
2017-06-15 10:39   ` Michal Hocko
2017-06-15 10:53   ` Tetsuo Handa
2017-06-15 10:53     ` Tetsuo Handa
2017-06-15 11:01     ` Michal Hocko
2017-06-15 11:01       ` Michal Hocko
2017-06-15 11:32       ` Tetsuo Handa
2017-06-15 11:32         ` Tetsuo Handa
2017-06-15 12:03         ` Michal Hocko
2017-06-15 12:03           ` Michal Hocko
2017-06-15 12:13           ` Michal Hocko
2017-06-15 12:13             ` Michal Hocko
2017-06-15 13:01             ` Tetsuo Handa
2017-06-15 13:01               ` Tetsuo Handa
2017-06-15 13:22               ` Michal Hocko
2017-06-15 13:22                 ` Michal Hocko
2017-06-15 21:43                 ` Tetsuo Handa
2017-06-15 21:43                   ` Tetsuo Handa
2017-06-15 21:37               ` David Rientjes
2017-06-15 21:37                 ` David Rientjes
2017-06-15 12:20       ` Michal Hocko
2017-06-15 12:20         ` Michal Hocko
2017-06-15 21:26   ` David Rientjes
2017-06-15 21:26     ` David Rientjes
2017-06-15 21:41     ` Michal Hocko
2017-06-15 21:41       ` Michal Hocko
2017-06-15 22:03       ` David Rientjes
2017-06-15 22:03         ` David Rientjes
2017-06-15 22:12         ` Michal Hocko
2017-06-15 22:12           ` Michal Hocko
2017-06-15 22:42           ` David Rientjes
2017-06-15 22:42             ` David Rientjes
2017-06-16  8:06             ` Michal Hocko
2017-06-16  8:06               ` Michal Hocko
2017-06-16  0:54           ` Tetsuo Handa
2017-06-16  0:54             ` Tetsuo Handa
2017-06-16  4:00             ` Tetsuo Handa
2017-06-16  4:00               ` Tetsuo Handa
2017-06-16  8:39             ` Michal Hocko
2017-06-16  8:39               ` Michal Hocko
2017-06-16 10:27               ` Tetsuo Handa
2017-06-16 10:27                 ` Tetsuo Handa
2017-06-16 11:02                 ` Michal Hocko
2017-06-16 11:02                   ` Michal Hocko
2017-06-16 14:26                   ` Re: [patch] mm, oom: prevent additional oom kills before memoryis freed Tetsuo Handa
2017-06-16 14:26                     ` Tetsuo Handa
2017-06-16 14:42                     ` Michal Hocko
2017-06-16 14:42                       ` Michal Hocko
2017-06-17 13:30                       ` Re: [patch] mm, oom: prevent additional oom kills before memory is freed Tetsuo Handa
2017-06-17 13:30                         ` Tetsuo Handa
2017-06-23 12:38                         ` Michal Hocko
2017-06-23 12:38                           ` Michal Hocko
2017-06-16 12:22       ` Tetsuo Handa
2017-06-16 12:22         ` Tetsuo Handa
2017-06-16 14:12         ` Michal Hocko
2017-06-16 14:12           ` Michal Hocko
2017-06-17  5:17           ` [PATCH] mm,oom_kill: Close race window of needlessly selecting new victims Tetsuo Handa
2017-06-17  5:17             ` Tetsuo Handa
2017-06-20 22:12             ` David Rientjes
2017-06-20 22:12               ` David Rientjes
2017-06-21  2:17               ` Tetsuo Handa
2017-06-21 20:31                 ` David Rientjes
2017-06-21 20:31                   ` David Rientjes
2017-06-22  0:53                   ` Tetsuo Handa
2017-06-23 12:45                     ` Michal Hocko
2017-06-23 12:45                       ` Michal Hocko
2017-06-21 13:18               ` Michal Hocko
2017-06-21 13:18                 ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.