All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation
@ 2017-12-07 11:42 Tetsuo Handa
  2017-12-07 11:51 ` Michal Hocko
  2017-12-11 11:57 ` Michal Hocko
  0 siblings, 2 replies; 17+ messages in thread
From: Tetsuo Handa @ 2017-12-07 11:42 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, Tetsuo Handa, Andrea Arcangeli, David Rientjes,
	Johannes Weiner, Manish Jaggi, Michal Hocko, Oleg Nesterov,
	Vladimir Davydov

Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core
count causes random kernel panics when an OOM victim which consumed memory
in a way the OOM reaper does not help was selected by the OOM killer [1].
Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
victim's mm were not able to try allocation from memory reserves after the
OOM reaper gave up reclaiming memory.

Therefore, this patch allows OOM victims to use ALLOC_OOM watermark for
last second allocation attempt.

[1] http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com

Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Manish Jaggi <mjaggi@caviumnetworks.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 39 +++++++++++++++++++++++++++++----------
 1 file changed, 29 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 73f5d45..5d054a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3309,6 +3309,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	return page;
 }
 
+static struct page *alloc_pages_before_oomkill(gfp_t gfp_mask,
+					       unsigned int order,
+					       const struct alloc_context *ac);
+
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	const struct alloc_context *ac, unsigned long *did_some_progress)
@@ -3334,16 +3338,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 		return NULL;
 	}
 
-	/*
-	 * Go through the zonelist yet one more time, keep very high watermark
-	 * here, this is only to catch a parallel oom killing, we must fail if
-	 * we're still under heavy pressure. But make sure that this reclaim
-	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
-	 * allocation which will never fail due to oom_lock already held.
-	 */
-	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
-				      ~__GFP_DIRECT_RECLAIM, order,
-				      ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
+	page = alloc_pages_before_oomkill(gfp_mask, order, ac);
 	if (page)
 		goto out;
 
@@ -3755,6 +3750,30 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return !!__gfp_pfmemalloc_flags(gfp_mask);
 }
 
+static struct page *alloc_pages_before_oomkill(gfp_t gfp_mask,
+					       unsigned int order,
+					       const struct alloc_context *ac)
+{
+	/*
+	 * Go through the zonelist yet one more time, keep very high watermark
+	 * here, this is only to catch a parallel oom killing, we must fail if
+	 * we're still under heavy pressure. But make sure that this reclaim
+	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
+	 * allocation which will never fail due to oom_lock already held.
+	 * Also, make sure that OOM victims can try ALLOC_OOM watermark
+	 * in case they haven't tried ALLOC_OOM watermark.
+	 */
+	int alloc_flags = ALLOC_CPUSET | ALLOC_WMARK_HIGH;
+	int reserve_flags;
+
+	gfp_mask |= __GFP_HARDWALL;
+	gfp_mask &= ~__GFP_DIRECT_RECLAIM;
+	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
+	if (reserve_flags)
+		alloc_flags = reserve_flags;
+	return get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
+}
+
 /*
  * Checks whether it makes sense to retry the reclaim to make a forward progress
  * for the given allocation request.
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation
  2017-12-07 11:42 [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation Tetsuo Handa
@ 2017-12-07 11:51 ` Michal Hocko
  2017-12-07 11:59   ` Tetsuo Handa
  2017-12-11 11:57 ` Michal Hocko
  1 sibling, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2017-12-07 11:51 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, Andrea Arcangeli, David Rientjes,
	Johannes Weiner, Manish Jaggi, Oleg Nesterov, Vladimir Davydov

On Thu 07-12-17 20:42:20, Tetsuo Handa wrote:
> Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core
> count causes random kernel panics when an OOM victim which consumed memory
> in a way the OOM reaper does not help was selected by the OOM killer [1].
> Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
> to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
> victim's mm were not able to try allocation from memory reserves after the
> OOM reaper gave up reclaiming memory.
> 
> Therefore, this patch allows OOM victims to use ALLOC_OOM watermark for
> last second allocation attempt.
> 
> [1] http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com
> 
> Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks")
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Reported-by: Manish Jaggi <mjaggi@caviumnetworks.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

I haven't acked _this_ patch! I will have a look but the patch is
different enough from the original that keeping any acks or reviews is
inappropriate. Do not do it again!

> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/page_alloc.c | 39 +++++++++++++++++++++++++++++----------
>  1 file changed, 29 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 73f5d45..5d054a4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3309,6 +3309,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  	return page;
>  }
>  
> +static struct page *alloc_pages_before_oomkill(gfp_t gfp_mask,
> +					       unsigned int order,
> +					       const struct alloc_context *ac);
> +
>  static inline struct page *
>  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	const struct alloc_context *ac, unsigned long *did_some_progress)
> @@ -3334,16 +3338,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  		return NULL;
>  	}
>  
> -	/*
> -	 * Go through the zonelist yet one more time, keep very high watermark
> -	 * here, this is only to catch a parallel oom killing, we must fail if
> -	 * we're still under heavy pressure. But make sure that this reclaim
> -	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> -	 * allocation which will never fail due to oom_lock already held.
> -	 */
> -	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
> -				      ~__GFP_DIRECT_RECLAIM, order,
> -				      ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
> +	page = alloc_pages_before_oomkill(gfp_mask, order, ac);
>  	if (page)
>  		goto out;
>  
> @@ -3755,6 +3750,30 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	return !!__gfp_pfmemalloc_flags(gfp_mask);
>  }
>  
> +static struct page *alloc_pages_before_oomkill(gfp_t gfp_mask,
> +					       unsigned int order,
> +					       const struct alloc_context *ac)
> +{
> +	/*
> +	 * Go through the zonelist yet one more time, keep very high watermark
> +	 * here, this is only to catch a parallel oom killing, we must fail if
> +	 * we're still under heavy pressure. But make sure that this reclaim
> +	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> +	 * allocation which will never fail due to oom_lock already held.
> +	 * Also, make sure that OOM victims can try ALLOC_OOM watermark
> +	 * in case they haven't tried ALLOC_OOM watermark.
> +	 */
> +	int alloc_flags = ALLOC_CPUSET | ALLOC_WMARK_HIGH;
> +	int reserve_flags;
> +
> +	gfp_mask |= __GFP_HARDWALL;
> +	gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> +	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
> +	if (reserve_flags)
> +		alloc_flags = reserve_flags;
> +	return get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
> +}
> +
>  /*
>   * Checks whether it makes sense to retry the reclaim to make a forward progress
>   * for the given allocation request.
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation
  2017-12-07 11:51 ` Michal Hocko
@ 2017-12-07 11:59   ` Tetsuo Handa
  2017-12-07 12:22     ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Tetsuo Handa @ 2017-12-07 11:59 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, linux-mm, aarcange, rientjes, hannes, mjaggi, oleg, vdavydov.dev

Michal Hocko wrote:
> On Thu 07-12-17 20:42:20, Tetsuo Handa wrote:
> > Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core
> > count causes random kernel panics when an OOM victim which consumed memory
> > in a way the OOM reaper does not help was selected by the OOM killer [1].
> > Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> > oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
> > to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
> > victim's mm were not able to try allocation from memory reserves after the
> > OOM reaper gave up reclaiming memory.
> > 
> > Therefore, this patch allows OOM victims to use ALLOC_OOM watermark for
> > last second allocation attempt.
> > 
> > [1] http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com
> > 
> > Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks")
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Reported-by: Manish Jaggi <mjaggi@caviumnetworks.com>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> 
> I haven't acked _this_ patch! I will have a look but the patch is
> different enough from the original that keeping any acks or reviews is
> inappropriate. Do not do it again!

I see. But nothing has changed except that this is called before entering
into the OOM killer. I assumed that this is a trivial change.

> 
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Oleg Nesterov <oleg@redhat.com>
> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/page_alloc.c | 39 +++++++++++++++++++++++++++++----------
> >  1 file changed, 29 insertions(+), 10 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 73f5d45..5d054a4 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3309,6 +3309,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
> >  	return page;
> >  }
> >  
> > +static struct page *alloc_pages_before_oomkill(gfp_t gfp_mask,
> > +					       unsigned int order,
> > +					       const struct alloc_context *ac);
> > +
> >  static inline struct page *
> >  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >  	const struct alloc_context *ac, unsigned long *did_some_progress)
> > @@ -3334,16 +3338,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
> >  		return NULL;
> >  	}
> >  
> > -	/*
> > -	 * Go through the zonelist yet one more time, keep very high watermark
> > -	 * here, this is only to catch a parallel oom killing, we must fail if
> > -	 * we're still under heavy pressure. But make sure that this reclaim
> > -	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> > -	 * allocation which will never fail due to oom_lock already held.
> > -	 */
> > -	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
> > -				      ~__GFP_DIRECT_RECLAIM, order,
> > -				      ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
> > +	page = alloc_pages_before_oomkill(gfp_mask, order, ac);
> >  	if (page)
> >  		goto out;
> >  
> > @@ -3755,6 +3750,30 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> >  	return !!__gfp_pfmemalloc_flags(gfp_mask);
> >  }
> >  
> > +static struct page *alloc_pages_before_oomkill(gfp_t gfp_mask,
> > +					       unsigned int order,
> > +					       const struct alloc_context *ac)
> > +{
> > +	/*
> > +	 * Go through the zonelist yet one more time, keep very high watermark
> > +	 * here, this is only to catch a parallel oom killing, we must fail if
> > +	 * we're still under heavy pressure. But make sure that this reclaim
> > +	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> > +	 * allocation which will never fail due to oom_lock already held.
> > +	 * Also, make sure that OOM victims can try ALLOC_OOM watermark
> > +	 * in case they haven't tried ALLOC_OOM watermark.
> > +	 */
> > +	int alloc_flags = ALLOC_CPUSET | ALLOC_WMARK_HIGH;
> > +	int reserve_flags;
> > +
> > +	gfp_mask |= __GFP_HARDWALL;
> > +	gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> > +	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
> > +	if (reserve_flags)
> > +		alloc_flags = reserve_flags;
> > +	return get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
> > +}
> > +
> >  /*
> >   * Checks whether it makes sense to retry the reclaim to make a forward progress
> >   * for the given allocation request.
> > -- 
> > 1.8.3.1
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation
  2017-12-07 11:59   ` Tetsuo Handa
@ 2017-12-07 12:22     ` Michal Hocko
  2017-12-08 10:58       ` Tetsuo Handa
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2017-12-07 12:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, aarcange, rientjes, hannes, mjaggi, oleg, vdavydov.dev

On Thu 07-12-17 20:59:34, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 07-12-17 20:42:20, Tetsuo Handa wrote:
> > > Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core
> > > count causes random kernel panics when an OOM victim which consumed memory
> > > in a way the OOM reaper does not help was selected by the OOM killer [1].
> > > Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> > > oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
> > > to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
> > > victim's mm were not able to try allocation from memory reserves after the
> > > OOM reaper gave up reclaiming memory.
> > > 
> > > Therefore, this patch allows OOM victims to use ALLOC_OOM watermark for
> > > last second allocation attempt.
> > > 
> > > [1] http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com
> > > 
> > > Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks")
> > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Reported-by: Manish Jaggi <mjaggi@caviumnetworks.com>
> > > Acked-by: Michal Hocko <mhocko@suse.com>
> > 
> > I haven't acked _this_ patch! I will have a look but the patch is
> > different enough from the original that keeping any acks or reviews is
> > inappropriate. Do not do it again!
> 
> I see. But nothing has changed except that this is called before entering
> into the OOM killer. I assumed that this is a trivial change.

Let the reviewers judge and have them add their acks/reviewed-bys again.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation
  2017-12-07 12:22     ` Michal Hocko
@ 2017-12-08 10:58       ` Tetsuo Handa
  2017-12-11 11:15         ` [PATCH] mm, oom: task_will_free_mem() should ignore MMF_OOM_SKIP unless __GFP_NOFAIL Tetsuo Handa
  2017-12-11 11:42         ` [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation Michal Hocko
  0 siblings, 2 replies; 17+ messages in thread
From: Tetsuo Handa @ 2017-12-08 10:58 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: akpm, linux-mm, aarcange, rientjes, mjaggi, oleg, vdavydov.dev

Michal Hocko wrote:
> On Thu 07-12-17 20:59:34, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Thu 07-12-17 20:42:20, Tetsuo Handa wrote:
> > > > Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core
> > > > count causes random kernel panics when an OOM victim which consumed memory
> > > > in a way the OOM reaper does not help was selected by the OOM killer [1].
> > > > Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> > > > oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
> > > > to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
> > > > victim's mm were not able to try allocation from memory reserves after the
> > > > OOM reaper gave up reclaiming memory.
> > > > 
> > > > Therefore, this patch allows OOM victims to use ALLOC_OOM watermark for
> > > > last second allocation attempt.
> > > > 
> > > > [1] http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com
> > > > 
> > > > Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks")
> > > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > > Reported-by: Manish Jaggi <mjaggi@caviumnetworks.com>
> > > > Acked-by: Michal Hocko <mhocko@suse.com>
> > > 
> > > I haven't acked _this_ patch! I will have a look but the patch is
> > > different enough from the original that keeping any acks or reviews is
> > > inappropriate. Do not do it again!
> > 
> > I see. But nothing has changed except that this is called before entering
> > into the OOM killer. I assumed that this is a trivial change.
> 
> Let the reviewers judge and have them add their acks/reviewed-bys again.

OK. Here I prepare food for review. Johannes, please be sure to respond.

When Manish reported this problem, I tried to manage this problem by
"mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once."
but it was rejected by Michal
( http://lkml.kernel.org/r/20170821084307.GB25956@dhcp22.suse.cz ).
Michal told me

  Sigh... Let me repeat for the last time (this whole thread is largely a
  waste of time to be honest). Find a _robust_ solution rather than
  fiddling with try-once-more kind of hacks. E.g. do an allocation attempt
  _before_ we do any disruptive action (aka kill a victim). This would
  help other cases when we race with an exiting tasks or somebody managed
  to free memory while we were selecting an oom victim which can take
  quite some time.

and I wrote "mm,oom: move last second allocation to inside the OOM killer" (with
an acceptance to apply "mm,oom: use ALLOC_OOM for OOM victim's last second
allocation" on top of it) but it was rejected by Johannes.

Therefore, I'm stuck between Michal and Johannes. And I updated "mm,oom: use
ALLOC_OOM for OOM victim's last second allocation" not to depend on "mm,oom:
move last second allocation to inside the OOM killer".

The original "mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP for once."
is the simplest change and easy to handle all corner cases. But if we go "mm,oom: use
ALLOC_OOM for OOM victim's last second allocation" approach, we also need to apply
"mm,oom: Set ->signal->oom_mm to all thread groups sharing the victim's mm."
( http://lkml.kernel.org/r/1511872888-4579-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp )
in order to handle corner cases, and it was rejected by Michal. Then, if we won't
handle corner cases, we need to update

	/*
	 * Kill all user processes sharing victim->mm in other thread groups, if
	 * any.  They don't get access to memory reserves, though, to avoid
	 * depletion of all memory.  This prevents mm->mmap_sem livelock when an
	 * oom killed thread cannot exit because it requires the semaphore and
	 * its contended by another thread trying to allocate memory itself.
	 * That thread will now get access to memory reserves since it has a
	 * pending fatal signal.
	 */

comment which is no longer true.

Michal and Johannes, please discuss between you and show me the direction to go.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] mm, oom: task_will_free_mem() should ignore MMF_OOM_SKIP unless __GFP_NOFAIL.
  2017-12-08 10:58       ` Tetsuo Handa
@ 2017-12-11 11:15         ` Tetsuo Handa
  2017-12-11 11:44           ` Michal Hocko
  2017-12-11 11:42         ` [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation Michal Hocko
  1 sibling, 1 reply; 17+ messages in thread
From: Tetsuo Handa @ 2017-12-11 11:15 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: akpm, linux-mm, aarcange, rientjes, mjaggi, oleg, vdavydov.dev,
	penguin-kernel

>From 6f45864753ce820adede5b318b9cb341ffd3e740 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Mon, 11 Dec 2017 19:52:07 +0900
Subject: [PATCH] mm, oom: task_will_free_mem() should ignore MMF_OOM_SKIP
 unless __GFP_NOFAIL.

Manish Jaggi noticed that running LTP oom01/oom02 tests with high core
count causes random kernel panics when an OOM victim which consumed memory
in a way the OOM reaper does not help was selected by the OOM killer [1].

Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
victim's mm were not able to try allocation from memory reserves after the
OOM reaper gave up reclaiming memory.

Since __alloc_pages_slowpath() will bail out after ALLOC_OOM allocation
failed (unless __GFP_NOFAIL is specified), this patch forces OOM victims
to try ALLOC_OOM allocation and then bail out rather than selecting next
OOM victim (unless __GFP_NOFAIL is specified which is necessary for
avoiding potential OOM lockup).

[1] http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com

Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Manish Jaggi <mjaggi@caviumnetworks.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/oom_kill.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7f54d9f..f71fe4c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -784,7 +784,7 @@ static inline bool __task_will_free_mem(struct task_struct *task)
  * Caller has to make sure that task->mm is stable (hold task_lock or
  * it operates on the current).
  */
-static bool task_will_free_mem(struct task_struct *task)
+static bool task_will_free_mem(struct task_struct *task, gfp_t gfp_mask)
 {
 	struct mm_struct *mm = task->mm;
 	struct task_struct *p;
@@ -802,10 +802,10 @@ static bool task_will_free_mem(struct task_struct *task)
 		return false;
 
 	/*
-	 * This task has already been drained by the oom reaper so there are
-	 * only small chances it will free some more
+	 * Select next OOM victim only if existing OOM victims can not satisfy
+	 * __GFP_NOFAIL allocation even after the OOM reaper reclaimed memory.
 	 */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags))
+	if ((gfp_mask & __GFP_NOFAIL) && test_bit(MMF_OOM_SKIP, &mm->flags))
 		return false;
 
 	if (atomic_read(&mm->mm_users) <= 1)
@@ -938,7 +938,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	 * so it can die quickly
 	 */
 	task_lock(p);
-	if (task_will_free_mem(p)) {
+	if (task_will_free_mem(p, oc->gfp_mask)) {
 		mark_oom_victim(p);
 		wake_oom_reaper(p);
 		task_unlock(p);
@@ -1092,7 +1092,7 @@ bool out_of_memory(struct oom_control *oc)
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (task_will_free_mem(current)) {
+	if (task_will_free_mem(current, oc->gfp_mask)) {
 		mark_oom_victim(current);
 		wake_oom_reaper(current);
 		return true;
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation
  2017-12-08 10:58       ` Tetsuo Handa
  2017-12-11 11:15         ` [PATCH] mm, oom: task_will_free_mem() should ignore MMF_OOM_SKIP unless __GFP_NOFAIL Tetsuo Handa
@ 2017-12-11 11:42         ` Michal Hocko
  2017-12-12  8:09           ` Tetsuo Handa
  1 sibling, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2017-12-11 11:42 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, akpm, linux-mm, aarcange, rientjes, mjaggi, oleg, vdavydov.dev

On Fri 08-12-17 19:58:11, Tetsuo Handa wrote:
[...]
> Therefore, I'm stuck between Michal and Johannes. And I updated "mm,oom: use
> ALLOC_OOM for OOM victim's last second allocation" not to depend on "mm,oom:
> move last second allocation to inside the OOM killer".

No, you seem to be stuck elsewhere. You keep ignoring that the OOM
killer is a best effort heuristic to free up some memory. And as any
other heuristic it has to balance cons and pros. One of the biggest
argument in those decision is how _serious_ the problem is another tweak
worth all the possible downfalls?

You keep repeating Manish report which is an _artificial_ OOM scenario.
It is true that we can do better in that case but the real solution
looks differently - we should make mlocked memory reapable. Now that is
not a trivial thing to do and I still have that on my todo list. You are
actively avoiding the real solution by providing tweaks (try one more
time) here and there. I really hate that approach. This will make the
behavior time dependant as Johannes pointed out.

I was OK with your "move the last allocation attempt" because it
conceptually makes some sense at least. Johannes had arguments against
and I do respect them because I do agree it is not a general and
_measurable_ win. And this is how the consensus based develoment works.
We are not pushing for questionable solutions unless there is an
absolute urge for that because the issue is serious and many users
suffer from it yet there is no real solution in sight. See the
difference?

That being said, I will keep refusing other such tweaks unless you have
a sound usecase behind. If you really _want_ to help out here then you
can focus on the reaping of the mlock memory.

Thanks
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm, oom: task_will_free_mem() should ignore MMF_OOM_SKIP unless __GFP_NOFAIL.
  2017-12-11 11:15         ` [PATCH] mm, oom: task_will_free_mem() should ignore MMF_OOM_SKIP unless __GFP_NOFAIL Tetsuo Handa
@ 2017-12-11 11:44           ` Michal Hocko
  0 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2017-12-11 11:44 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, akpm, linux-mm, aarcange, rientjes, mjaggi, oleg, vdavydov.dev

On Mon 11-12-17 20:15:35, Tetsuo Handa wrote:
> >From 6f45864753ce820adede5b318b9cb341ffd3e740 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Mon, 11 Dec 2017 19:52:07 +0900
> Subject: [PATCH] mm, oom: task_will_free_mem() should ignore MMF_OOM_SKIP
>  unless __GFP_NOFAIL.
> 
> Manish Jaggi noticed that running LTP oom01/oom02 tests with high core
> count causes random kernel panics when an OOM victim which consumed memory
> in a way the OOM reaper does not help was selected by the OOM killer [1].

How does this have anything to do with any GFP_NOFAIL allocations?

> Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
> to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
> victim's mm were not able to try allocation from memory reserves after the
> OOM reaper gave up reclaiming memory.
> 
> Since __alloc_pages_slowpath() will bail out after ALLOC_OOM allocation
> failed (unless __GFP_NOFAIL is specified), this patch forces OOM victims
> to try ALLOC_OOM allocation and then bail out rather than selecting next
> OOM victim (unless __GFP_NOFAIL is specified which is necessary for
> avoiding potential OOM lockup).

Nack again. The changelog doesn't make any sense at all. And I am really
not convinced this actually fixes any real problem. Please do not bother
even sending patches like that.

> [1] http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com
> 
> Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks")
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Reported-by: Manish Jaggi <mjaggi@caviumnetworks.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/oom_kill.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 7f54d9f..f71fe4c 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -784,7 +784,7 @@ static inline bool __task_will_free_mem(struct task_struct *task)
>   * Caller has to make sure that task->mm is stable (hold task_lock or
>   * it operates on the current).
>   */
> -static bool task_will_free_mem(struct task_struct *task)
> +static bool task_will_free_mem(struct task_struct *task, gfp_t gfp_mask)
>  {
>  	struct mm_struct *mm = task->mm;
>  	struct task_struct *p;
> @@ -802,10 +802,10 @@ static bool task_will_free_mem(struct task_struct *task)
>  		return false;
>  
>  	/*
> -	 * This task has already been drained by the oom reaper so there are
> -	 * only small chances it will free some more
> +	 * Select next OOM victim only if existing OOM victims can not satisfy
> +	 * __GFP_NOFAIL allocation even after the OOM reaper reclaimed memory.
>  	 */
> -	if (test_bit(MMF_OOM_SKIP, &mm->flags))
> +	if ((gfp_mask & __GFP_NOFAIL) && test_bit(MMF_OOM_SKIP, &mm->flags))
>  		return false;
>  
>  	if (atomic_read(&mm->mm_users) <= 1)
> @@ -938,7 +938,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	 * so it can die quickly
>  	 */
>  	task_lock(p);
> -	if (task_will_free_mem(p)) {
> +	if (task_will_free_mem(p, oc->gfp_mask)) {
>  		mark_oom_victim(p);
>  		wake_oom_reaper(p);
>  		task_unlock(p);
> @@ -1092,7 +1092,7 @@ bool out_of_memory(struct oom_control *oc)
>  	 * select it.  The goal is to allow it to allocate so that it may
>  	 * quickly exit and free its memory.
>  	 */
> -	if (task_will_free_mem(current)) {
> +	if (task_will_free_mem(current, oc->gfp_mask)) {
>  		mark_oom_victim(current);
>  		wake_oom_reaper(current);
>  		return true;
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation
  2017-12-07 11:42 [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation Tetsuo Handa
  2017-12-07 11:51 ` Michal Hocko
@ 2017-12-11 11:57 ` Michal Hocko
  2017-12-13 11:06   ` Tetsuo Handa
  1 sibling, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2017-12-11 11:57 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, Andrea Arcangeli, David Rientjes,
	Johannes Weiner, Manish Jaggi, Oleg Nesterov, Vladimir Davydov

[I didn't get to this patch any sooner, sorry about that]

On Thu 07-12-17 20:42:20, Tetsuo Handa wrote:
> Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core
> count causes random kernel panics when an OOM victim which consumed memory
> in a way the OOM reaper does not help was selected by the OOM killer [1].
> Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
> to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
> victim's mm were not able to try allocation from memory reserves after the
> OOM reaper gave up reclaiming memory.
> 
> Therefore, this patch allows OOM victims to use ALLOC_OOM watermark for
> last second allocation attempt.

This changelog doesn't explain the problem, nor does it say why it
should help. I would even argue that mentioning the LTP test is more
confusing than helpful (also considering it a fix for 696453e66630ad45)
because granting access to memory reserves will only help partially.
Anyway, the patch makes some sense to me but I am not going to ack it
with a misleading changelog.

> [1] http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com
> 
> Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks")
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Reported-by: Manish Jaggi <mjaggi@caviumnetworks.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/page_alloc.c | 39 +++++++++++++++++++++++++++++----------
>  1 file changed, 29 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 73f5d45..5d054a4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3309,6 +3309,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  	return page;
>  }
>  
> +static struct page *alloc_pages_before_oomkill(gfp_t gfp_mask,
> +					       unsigned int order,
> +					       const struct alloc_context *ac);
> +
>  static inline struct page *
>  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	const struct alloc_context *ac, unsigned long *did_some_progress)
> @@ -3334,16 +3338,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  		return NULL;
>  	}
>  
> -	/*
> -	 * Go through the zonelist yet one more time, keep very high watermark
> -	 * here, this is only to catch a parallel oom killing, we must fail if
> -	 * we're still under heavy pressure. But make sure that this reclaim
> -	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> -	 * allocation which will never fail due to oom_lock already held.
> -	 */
> -	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
> -				      ~__GFP_DIRECT_RECLAIM, order,
> -				      ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
> +	page = alloc_pages_before_oomkill(gfp_mask, order, ac);
>  	if (page)
>  		goto out;
>  
> @@ -3755,6 +3750,30 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	return !!__gfp_pfmemalloc_flags(gfp_mask);
>  }
>  
> +static struct page *alloc_pages_before_oomkill(gfp_t gfp_mask,
> +					       unsigned int order,
> +					       const struct alloc_context *ac)
> +{
> +	/*
> +	 * Go through the zonelist yet one more time, keep very high watermark
> +	 * here, this is only to catch a parallel oom killing, we must fail if
> +	 * we're still under heavy pressure. But make sure that this reclaim
> +	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> +	 * allocation which will never fail due to oom_lock already held.
> +	 * Also, make sure that OOM victims can try ALLOC_OOM watermark
> +	 * in case they haven't tried ALLOC_OOM watermark.
> +	 */
> +	int alloc_flags = ALLOC_CPUSET | ALLOC_WMARK_HIGH;
> +	int reserve_flags;
> +
> +	gfp_mask |= __GFP_HARDWALL;
> +	gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> +	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
> +	if (reserve_flags)
> +		alloc_flags = reserve_flags;
> +	return get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
> +}
> +
>  /*
>   * Checks whether it makes sense to retry the reclaim to make a forward progress
>   * for the given allocation request.
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation
  2017-12-11 11:42         ` [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation Michal Hocko
@ 2017-12-12  8:09           ` Tetsuo Handa
  2017-12-12 10:07             ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Tetsuo Handa @ 2017-12-12  8:09 UTC (permalink / raw)
  To: mhocko
  Cc: hannes, akpm, linux-mm, aarcange, rientjes, mjaggi, oleg, vdavydov.dev

Michal Hocko wrote:
> That being said, I will keep refusing other such tweaks unless you have
> a sound usecase behind. If you really _want_ to help out here then you
> can focus on the reaping of the mlock memory.

Not the reaping of the mlock'ed memory. Although Manish's report was mlock'ed
case, there are other cases (e.g. MAP_SHARED, mmu_notifier, mmap_sem held for
write) which can lead to this race condition. If we think about artificial case,
it would be possible to run 1024 threads not sharing signal_struct but consume
almost 0KB memory (i.e. written without using C library) and many of them are
running between __gfp_pfmemalloc_flags() and mutex_trylock() waiting for
ALLOC_OOM.

What the Manish's report revealed is the fact that we accidentally broke the

	/*
	 * Kill all user processes sharing victim->mm in other thread groups, if
	 * any.  They don't get access to memory reserves, though, to avoid
	 * depletion of all memory.  This prevents mm->mmap_sem livelock when an
	 * oom killed thread cannot exit because it requires the semaphore and
	 * its contended by another thread trying to allocate memory itself.
	 * That thread will now get access to memory reserves since it has a
	 * pending fatal signal.
	 */

assumption via 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
oom_reaped tasks"), and we already wasted for 16 months. There is no need to
wait for fixing mlock'ed, MAP_SHARED, mmu_notifier and mmap_sem cases because
"OOM victims consuming almost 0KB memory" case cannot be solved.

The mlock'ed, MAP_SHARED, mmu_notifier and mmap_sem cases are a sort of alias
of "OOM victims consuming almost 0KB memory" case.

Anyway, since you introduced MMF_OOM_VICTIM flag, I will try a patch which
checks MMF_OOM_VICTIM instead of oom_reserves_allowed().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation
  2017-12-12  8:09           ` Tetsuo Handa
@ 2017-12-12 10:07             ` Michal Hocko
  0 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2017-12-12 10:07 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, akpm, linux-mm, aarcange, rientjes, mjaggi, oleg, vdavydov.dev

On Tue 12-12-17 17:09:36, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > That being said, I will keep refusing other such tweaks unless you have
> > a sound usecase behind. If you really _want_ to help out here then you
> > can focus on the reaping of the mlock memory.
> 
> Not the reaping of the mlock'ed memory. Although Manish's report was mlock'ed
> case, there are other cases (e.g. MAP_SHARED, mmu_notifier, mmap_sem held for
> write) which can lead to this race condition.

Could you actually start thinking in a bigger picture rather than
obsessively check the code? If MAP_SHARED is file backed then it should
be reclaimable. If it is memory backed then we are screwed with
insufficient sized swap partition. I've seen that David is already
looking at the mmu_notifier. Well and the mmap_sem, sure that can happen
but as long we do not see excessive OOM events out there I would rather
leave it alone.

> If we think about artificial case,
> it would be possible to run 1024 threads not sharing signal_struct but consume
> almost 0KB memory (i.e. written without using C library) and many of them are
> running between __gfp_pfmemalloc_flags() and mutex_trylock() waiting for
> ALLOC_OOM.

Sigh... Nobody is arguing that the race is impossible. Just read what
I've wrote. I recognize the race but I am not willing to add kludges
into an already complicated code if it doesn't matter _practically_.
A malicious user can DOS the system by other means and you have to
configure your system carefully to prevent from that.

So all I care about is to see whether these races happen out there in
natural workloads and then we can more heuristics.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation
  2017-12-11 11:57 ` Michal Hocko
@ 2017-12-13 11:06   ` Tetsuo Handa
  2017-12-19 14:36     ` Tetsuo Handa
  0 siblings, 1 reply; 17+ messages in thread
From: Tetsuo Handa @ 2017-12-13 11:06 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, linux-mm, aarcange, rientjes, hannes, mjaggi, oleg, vdavydov.dev

Michal Hocko wrote:
> > Therefore, this patch allows OOM victims to use ALLOC_OOM watermark for
> > last second allocation attempt.
> 
> This changelog doesn't explain the problem, nor does it say why it
> should help. I would even argue that mentioning the LTP test is more
> confusing than helpful (also considering it a fix for 696453e66630ad45)
> because granting access to memory reserves will only help partially.

I know granting access to memory reserves will only help partially.
The intent of granting access to memory reserves is to reduce needlessly
OOM killing more victims.

> Anyway, the patch makes some sense to me but I am not going to ack it
> with a misleading changelog.
> 

Apart from how the changelog will look like, below is an updated patch
which to some degree recovers

	 * That thread will now get access to memory reserves since it has a
	 * pending fatal signal.

comment. It is pity that we will need to run more instructions in the fastpath
of __alloc_pages_slowpath() compared to "current->oom_kill_free_check_raced"
at out_of_memory(). Is this direction acceptable?

---
 mm/page_alloc.c | 53 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 40 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 31c1a61..f7bd969 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3334,6 +3334,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	return page;
 }
 
+static struct page *alloc_pages_before_oomkill(gfp_t gfp_mask,
+					       unsigned int order,
+					       const struct alloc_context *ac);
+
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	const struct alloc_context *ac, unsigned long *did_some_progress)
@@ -3359,16 +3363,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 		return NULL;
 	}
 
-	/*
-	 * Go through the zonelist yet one more time, keep very high watermark
-	 * here, this is only to catch a parallel oom killing, we must fail if
-	 * we're still under heavy pressure. But make sure that this reclaim
-	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
-	 * allocation which will never fail due to oom_lock already held.
-	 */
-	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
-				      ~__GFP_DIRECT_RECLAIM, order,
-				      ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
+	page = alloc_pages_before_oomkill(gfp_mask, order, ac);
 	if (page)
 		goto out;
 
@@ -3734,9 +3729,17 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
 	return alloc_flags;
 }
 
-static bool oom_reserves_allowed(struct task_struct *tsk)
+static bool oom_reserves_allowed(void)
 {
-	if (!tsk_is_oom_victim(tsk))
+	struct mm_struct *mm = current->mm;
+
+	if (!mm)
+		mm = current->signal->oom_mm;
+	/* MMF_OOM_VICTIM not set on mm means that I am not an OOM victim. */
+	if (!mm || !test_bit(MMF_OOM_VICTIM, &mm->flags))
+		return false;
+	/* MMF_OOM_VICTIM can be set on mm used by the global init process. */
+	if (!fatal_signal_pending(current) && !(current->flags & PF_EXITING))
 		return false;
 
 	/*
@@ -3764,7 +3767,7 @@ static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
 	if (!in_interrupt()) {
 		if (current->flags & PF_MEMALLOC)
 			return ALLOC_NO_WATERMARKS;
-		else if (oom_reserves_allowed(current))
+		else if (oom_reserves_allowed())
 			return ALLOC_OOM;
 	}
 
@@ -3776,6 +3779,30 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return !!__gfp_pfmemalloc_flags(gfp_mask);
 }
 
+static struct page *alloc_pages_before_oomkill(gfp_t gfp_mask,
+					       unsigned int order,
+					       const struct alloc_context *ac)
+{
+	/*
+	 * Go through the zonelist yet one more time, keep very high watermark
+	 * here, this is only to catch a parallel oom killing, we must fail if
+	 * we're still under heavy pressure. But make sure that this reclaim
+	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
+	 * allocation which will never fail due to oom_lock already held.
+	 * Also, make sure that OOM victims can try ALLOC_OOM watermark
+	 * in case they haven't tried ALLOC_OOM watermark.
+	 */
+	int alloc_flags = ALLOC_CPUSET | ALLOC_WMARK_HIGH;
+	int reserve_flags;
+
+	gfp_mask |= __GFP_HARDWALL;
+	gfp_mask &= ~__GFP_DIRECT_RECLAIM;
+	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
+	if (reserve_flags)
+		alloc_flags = reserve_flags;
+	return get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
+}
+
 /*
  * Checks whether it makes sense to retry the reclaim to make a forward progress
  * for the given allocation request.
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation
  2017-12-13 11:06   ` Tetsuo Handa
@ 2017-12-19 14:36     ` Tetsuo Handa
  2017-12-19 14:55       ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Tetsuo Handa @ 2017-12-19 14:36 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, linux-mm, aarcange, rientjes, hannes, mjaggi, oleg, vdavydov.dev

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > Therefore, this patch allows OOM victims to use ALLOC_OOM watermark for
> > > last second allocation attempt.
> > 
> > This changelog doesn't explain the problem, nor does it say why it
> > should help. I would even argue that mentioning the LTP test is more
> > confusing than helpful (also considering it a fix for 696453e66630ad45)
> > because granting access to memory reserves will only help partially.
> 
> I know granting access to memory reserves will only help partially.
> The intent of granting access to memory reserves is to reduce needlessly
> OOM killing more victims.
> 
> > Anyway, the patch makes some sense to me but I am not going to ack it
> > with a misleading changelog.
> > 
> 
> Apart from how the changelog will look like, below is an updated patch
> which to some degree recovers
> 
> 	 * That thread will now get access to memory reserves since it has a
> 	 * pending fatal signal.
> 
> comment. It is pity that we will need to run more instructions in the fastpath
> of __alloc_pages_slowpath() compared to "current->oom_kill_free_check_raced"
> at out_of_memory(). Is this direction acceptable?

If http://lkml.kernel.org/r/20171219114012.GK2787@dhcp22.suse.cz ,
is direction below acceptable?

> 
> ---
>  mm/page_alloc.c | 53 ++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 40 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 31c1a61..f7bd969 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3334,6 +3334,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  	return page;
>  }
>  
> +static struct page *alloc_pages_before_oomkill(gfp_t gfp_mask,
> +					       unsigned int order,
> +					       const struct alloc_context *ac);
> +
>  static inline struct page *
>  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	const struct alloc_context *ac, unsigned long *did_some_progress)
> @@ -3359,16 +3363,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  		return NULL;
>  	}
>  
> -	/*
> -	 * Go through the zonelist yet one more time, keep very high watermark
> -	 * here, this is only to catch a parallel oom killing, we must fail if
> -	 * we're still under heavy pressure. But make sure that this reclaim
> -	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> -	 * allocation which will never fail due to oom_lock already held.
> -	 */
> -	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
> -				      ~__GFP_DIRECT_RECLAIM, order,
> -				      ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
> +	page = alloc_pages_before_oomkill(gfp_mask, order, ac);
>  	if (page)
>  		goto out;
>  
> @@ -3734,9 +3729,17 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
>  	return alloc_flags;
>  }
>  
> -static bool oom_reserves_allowed(struct task_struct *tsk)
> +static bool oom_reserves_allowed(void)
>  {
> -	if (!tsk_is_oom_victim(tsk))
> +	struct mm_struct *mm = current->mm;
> +
> +	if (!mm)
> +		mm = current->signal->oom_mm;
> +	/* MMF_OOM_VICTIM not set on mm means that I am not an OOM victim. */
> +	if (!mm || !test_bit(MMF_OOM_VICTIM, &mm->flags))
> +		return false;
> +	/* MMF_OOM_VICTIM can be set on mm used by the global init process. */
> +	if (!fatal_signal_pending(current) && !(current->flags & PF_EXITING))
>  		return false;
>  
>  	/*
> @@ -3764,7 +3767,7 @@ static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
>  	if (!in_interrupt()) {
>  		if (current->flags & PF_MEMALLOC)
>  			return ALLOC_NO_WATERMARKS;
> -		else if (oom_reserves_allowed(current))
> +		else if (oom_reserves_allowed())
>  			return ALLOC_OOM;
>  	}
>  
> @@ -3776,6 +3779,30 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	return !!__gfp_pfmemalloc_flags(gfp_mask);
>  }
>  
> +static struct page *alloc_pages_before_oomkill(gfp_t gfp_mask,
> +					       unsigned int order,
> +					       const struct alloc_context *ac)
> +{
> +	/*
> +	 * Go through the zonelist yet one more time, keep very high watermark
> +	 * here, this is only to catch a parallel oom killing, we must fail if
> +	 * we're still under heavy pressure. But make sure that this reclaim
> +	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> +	 * allocation which will never fail due to oom_lock already held.
> +	 * Also, make sure that OOM victims can try ALLOC_OOM watermark
> +	 * in case they haven't tried ALLOC_OOM watermark.
> +	 */
> +	int alloc_flags = ALLOC_CPUSET | ALLOC_WMARK_HIGH;
> +	int reserve_flags;
> +
> +	gfp_mask |= __GFP_HARDWALL;
> +	gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> +	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
> +	if (reserve_flags)
> +		alloc_flags = reserve_flags;
> +	return get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
> +}
> +
>  /*
>   * Checks whether it makes sense to retry the reclaim to make a forward progress
>   * for the given allocation request.
> -- 
> 1.8.3.1
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation
  2017-12-19 14:36     ` Tetsuo Handa
@ 2017-12-19 14:55       ` Michal Hocko
  2017-12-21 15:34         ` Tetsuo Handa
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2017-12-19 14:55 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, aarcange, rientjes, hannes, mjaggi, oleg, vdavydov.dev

On Tue 19-12-17 23:36:02, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > > Therefore, this patch allows OOM victims to use ALLOC_OOM watermark for
> > > > last second allocation attempt.
> > > 
> > > This changelog doesn't explain the problem, nor does it say why it
> > > should help. I would even argue that mentioning the LTP test is more
> > > confusing than helpful (also considering it a fix for 696453e66630ad45)
> > > because granting access to memory reserves will only help partially.
> > 
> > I know granting access to memory reserves will only help partially.
> > The intent of granting access to memory reserves is to reduce needlessly
> > OOM killing more victims.
> > 
> > > Anyway, the patch makes some sense to me but I am not going to ack it
> > > with a misleading changelog.
> > > 
> > 
> > Apart from how the changelog will look like, below is an updated patch
> > which to some degree recovers
> > 
> > 	 * That thread will now get access to memory reserves since it has a
> > 	 * pending fatal signal.
> > 
> > comment. It is pity that we will need to run more instructions in the fastpath
> > of __alloc_pages_slowpath() compared to "current->oom_kill_free_check_raced"
> > at out_of_memory(). Is this direction acceptable?
> 
> If http://lkml.kernel.org/r/20171219114012.GK2787@dhcp22.suse.cz ,
> is direction below acceptable?

The same applies here. You are touching the way how the memory reserves
are access in non-trivial way. You better have a very good reason for
that. So far you keep playing with different corner cases while you keep
showing that you do not really want to understand a bigger picture. This
can end up in regressions easily. Let me repeat something I've said a
long ago. We do not optimize for corner cases. We want to survive but if
an alternative is to kill another task then we can live with that.
 
> > ---
> >  mm/page_alloc.c | 53 ++++++++++++++++++++++++++++++++++++++++-------------
> >  1 file changed, 40 insertions(+), 13 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 31c1a61..f7bd969 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3334,6 +3334,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
> >  	return page;
> >  }
> >  
> > +static struct page *alloc_pages_before_oomkill(gfp_t gfp_mask,
> > +					       unsigned int order,
> > +					       const struct alloc_context *ac);
> > +
> >  static inline struct page *
> >  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >  	const struct alloc_context *ac, unsigned long *did_some_progress)
> > @@ -3359,16 +3363,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
> >  		return NULL;
> >  	}
> >  
> > -	/*
> > -	 * Go through the zonelist yet one more time, keep very high watermark
> > -	 * here, this is only to catch a parallel oom killing, we must fail if
> > -	 * we're still under heavy pressure. But make sure that this reclaim
> > -	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> > -	 * allocation which will never fail due to oom_lock already held.
> > -	 */
> > -	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
> > -				      ~__GFP_DIRECT_RECLAIM, order,
> > -				      ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
> > +	page = alloc_pages_before_oomkill(gfp_mask, order, ac);
> >  	if (page)
> >  		goto out;
> >  
> > @@ -3734,9 +3729,17 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
> >  	return alloc_flags;
> >  }
> >  
> > -static bool oom_reserves_allowed(struct task_struct *tsk)
> > +static bool oom_reserves_allowed(void)
> >  {
> > -	if (!tsk_is_oom_victim(tsk))
> > +	struct mm_struct *mm = current->mm;
> > +
> > +	if (!mm)
> > +		mm = current->signal->oom_mm;
> > +	/* MMF_OOM_VICTIM not set on mm means that I am not an OOM victim. */
> > +	if (!mm || !test_bit(MMF_OOM_VICTIM, &mm->flags))
> > +		return false;
> > +	/* MMF_OOM_VICTIM can be set on mm used by the global init process. */
> > +	if (!fatal_signal_pending(current) && !(current->flags & PF_EXITING))
> >  		return false;
> >  
> >  	/*
> > @@ -3764,7 +3767,7 @@ static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
> >  	if (!in_interrupt()) {
> >  		if (current->flags & PF_MEMALLOC)
> >  			return ALLOC_NO_WATERMARKS;
> > -		else if (oom_reserves_allowed(current))
> > +		else if (oom_reserves_allowed())
> >  			return ALLOC_OOM;
> >  	}
> >  
> > @@ -3776,6 +3779,30 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> >  	return !!__gfp_pfmemalloc_flags(gfp_mask);
> >  }
> >  
> > +static struct page *alloc_pages_before_oomkill(gfp_t gfp_mask,
> > +					       unsigned int order,
> > +					       const struct alloc_context *ac)
> > +{
> > +	/*
> > +	 * Go through the zonelist yet one more time, keep very high watermark
> > +	 * here, this is only to catch a parallel oom killing, we must fail if
> > +	 * we're still under heavy pressure. But make sure that this reclaim
> > +	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> > +	 * allocation which will never fail due to oom_lock already held.
> > +	 * Also, make sure that OOM victims can try ALLOC_OOM watermark
> > +	 * in case they haven't tried ALLOC_OOM watermark.
> > +	 */
> > +	int alloc_flags = ALLOC_CPUSET | ALLOC_WMARK_HIGH;
> > +	int reserve_flags;
> > +
> > +	gfp_mask |= __GFP_HARDWALL;
> > +	gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> > +	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
> > +	if (reserve_flags)
> > +		alloc_flags = reserve_flags;
> > +	return get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
> > +}
> > +
> >  /*
> >   * Checks whether it makes sense to retry the reclaim to make a forward progress
> >   * for the given allocation request.
> > -- 
> > 1.8.3.1
> > 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation
  2017-12-19 14:55       ` Michal Hocko
@ 2017-12-21 15:34         ` Tetsuo Handa
  2017-12-21 16:42           ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Tetsuo Handa @ 2017-12-21 15:34 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, linux-mm, aarcange, rientjes, hannes, mjaggi, oleg, vdavydov.dev

Michal Hocko wrote:
> On Tue 19-12-17 23:36:02, Tetsuo Handa wrote:
> > If http://lkml.kernel.org/r/20171219114012.GK2787@dhcp22.suse.cz ,
> > is direction below acceptable?
> 
> The same applies here. You are touching the way how the memory reserves
> are access in non-trivial way. You better have a very good reason for
> that. So far you keep playing with different corner cases while you keep
> showing that you do not really want to understand a bigger picture. This
> can end up in regressions easily.

Any OOM-killed thread is allowed to use memory reserves up to ALLOC_OOM
watermark. How can allowing all OOM-killed threads to try ALLOC_OOM
watermark cause regressions?

Commit cd04ae1e2dc8e365 ("mm, oom: do not rely on TIF_MEMDIE for memory
reserves access") changed from only TIF_MEMDIE thread to all threads in
one thread group. But we don't call it a regression.

My proposal is nothing but changes from all threads in one thread group to
all threads in all thread groups (which were killed due to sharing the
victim's mm). And how can we call it a regression?

>                                   Let me repeat something I've said a
> long ago. We do not optimize for corner cases. We want to survive but if
> an alternative is to kill another task then we can live with that.
>  

Setting MMF_OOM_SKIP before all OOM-killed threads try memory reserves
leads to needlessly selecting more OOM victims.

Unless any OOM-killed thread fails to satisfy allocation even with ALLOC_OOM,
no OOM-killed thread needs to select more OOM victims. Commit 696453e66630ad45
("mm, oom: task_will_free_mem should skip oom_reaped tasks") obviously broke
it, which is exactly a regression.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation
  2017-12-21 15:34         ` Tetsuo Handa
@ 2017-12-21 16:42           ` Michal Hocko
  2017-12-23 14:41             ` Tetsuo Handa
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2017-12-21 16:42 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, aarcange, rientjes, hannes, mjaggi, oleg, vdavydov.dev

On Fri 22-12-17 00:34:05, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> >                                   Let me repeat something I've said a
> > long ago. We do not optimize for corner cases. We want to survive but if
> > an alternative is to kill another task then we can live with that.
> >  
> 
> Setting MMF_OOM_SKIP before all OOM-killed threads try memory reserves
> leads to needlessly selecting more OOM victims.
> 
> Unless any OOM-killed thread fails to satisfy allocation even with ALLOC_OOM,
> no OOM-killed thread needs to select more OOM victims. Commit 696453e66630ad45
> ("mm, oom: task_will_free_mem should skip oom_reaped tasks") obviously broke
> it, which is exactly a regression.

You are trying to fix a completely artificial case. Or do you have any
example of an application which uses CLONE_VM without sharing signals?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation
  2017-12-21 16:42           ` Michal Hocko
@ 2017-12-23 14:41             ` Tetsuo Handa
  0 siblings, 0 replies; 17+ messages in thread
From: Tetsuo Handa @ 2017-12-23 14:41 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, linux-mm, aarcange, rientjes, hannes, mjaggi, oleg,
	vdavydov.dev, torvalds

Michal Hocko wrote:
> On Fri 22-12-17 00:34:05, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > >                                   Let me repeat something I've said a
> > > long ago. We do not optimize for corner cases. We want to survive but if
> > > an alternative is to kill another task then we can live with that.
> > >  
> > 
> > Setting MMF_OOM_SKIP before all OOM-killed threads try memory reserves
> > leads to needlessly selecting more OOM victims.
> > 
> > Unless any OOM-killed thread fails to satisfy allocation even with ALLOC_OOM,
> > no OOM-killed thread needs to select more OOM victims. Commit 696453e66630ad45
> > ("mm, oom: task_will_free_mem should skip oom_reaped tasks") obviously broke
> > it, which is exactly a regression.
> 
> You are trying to fix a completely artificial case. Or do you have any
> example of an application which uses CLONE_VM without sharing signals?

Your response is an invalid and insane resistance.

You dare to silently made user visible changes. If you really believe that
there is no application which uses CLONE_VM without sharing signals, let's
revert below patch.

----------
commit 44a70adec910d6929689e42b6e5cee5b7d202d20
Author: Michal Hocko <mhocko@suse.com>
Date:   Thu Jul 28 15:44:43 2016 -0700

    mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj

    oom_score_adj is shared for the thread groups (via struct signal) but this
    is not sufficient to cover processes sharing mm (CLONE_VM without
    CLONE_SIGHAND) and so we can easily end up in a situation when some
    processes update their oom_score_adj and confuse the oom killer.  In the
    worst case some of those processes might hide from the oom killer
    altogether via OOM_SCORE_ADJ_MIN while others are eligible.  OOM killer
    would then pick up those eligible but won't be allowed to kill others
    sharing the same mm so the mm wouldn't release the mm and so the memory.

    It would be ideal to have the oom_score_adj per mm_struct because that is
    the natural entity OOM killer considers.  But this will not work because
    some programs are doing

        vfork()
        set_oom_adj()
        exec()

    We can achieve the same though.  oom_score_adj write handler can set the
    oom_score_adj for all processes sharing the same mm if the task is not in
    the middle of vfork.  As a result all the processes will share the same
    oom_score_adj.  The current implementation is rather pessimistic and
    checks all the existing processes by default if there is more than 1
    holder of the mm but we do not have any reliable way to check for external
    users yet.

    Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Oleg Nesterov <oleg@redhat.com>
    Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
----------

We do prepare for the worst case. We have just experienced two mistakes
by not considering the worst case.

----------
commit 400e22499dd92613821374c8c6c88c7225359980
Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date:   Wed Nov 15 17:38:37 2017 -0800

    mm: don't warn about allocations which stall for too long

    Commit 63f53dea0c98 ("mm: warn about allocations which stall for too
    long") was a great step for reducing possibility of silent hang up
    problem caused by memory allocation stalls.  But this commit reverts it,
    for it is possible to trigger OOM lockup and/or soft lockups when many
    threads concurrently called warn_alloc() (in order to warn about memory
    allocation stalls) due to current implementation of printk(), and it is
    difficult to obtain useful information due to limitation of synchronous
    warning approach.

    Current printk() implementation flushes all pending logs using the
    context of a thread which called console_unlock().  printk() should be
    able to flush all pending logs eventually unless somebody continues
    appending to printk() buffer.

    Since warn_alloc() started appending to printk() buffer while waiting
    for oom_kill_process() to make forward progress when oom_kill_process()
    is processing pending logs, it became possible for warn_alloc() to force
    oom_kill_process() loop inside printk().  As a result, warn_alloc()
    significantly increased possibility of preventing oom_kill_process()
    from making forward progress.

    ---------- Pseudo code start ----------
    Before warn_alloc() was introduced:

      retry:
        if (mutex_trylock(&oom_lock)) {
          while (atomic_read(&printk_pending_logs) > 0) {
            atomic_dec(&printk_pending_logs);
            print_one_log();
          }
          // Send SIGKILL here.
          mutex_unlock(&oom_lock)
        }
        goto retry;

    After warn_alloc() was introduced:

      retry:
        if (mutex_trylock(&oom_lock)) {
          while (atomic_read(&printk_pending_logs) > 0) {
            atomic_dec(&printk_pending_logs);
            print_one_log();
          }
          // Send SIGKILL here.
          mutex_unlock(&oom_lock)
        } else if (waited_for_10seconds()) {
          atomic_inc(&printk_pending_logs);
        }
        goto retry;
    ---------- Pseudo code end ----------

    Although waited_for_10seconds() becomes true once per 10 seconds,
    unbounded number of threads can call waited_for_10seconds() at the same
    time.  Also, since threads doing waited_for_10seconds() keep doing
    almost busy loop, the thread doing print_one_log() can use little CPU
    resource.  Therefore, this situation can be simplified like

    ---------- Pseudo code start ----------
      retry:
        if (mutex_trylock(&oom_lock)) {
          while (atomic_read(&printk_pending_logs) > 0) {
            atomic_dec(&printk_pending_logs);
            print_one_log();
          }
          // Send SIGKILL here.
          mutex_unlock(&oom_lock)
        } else {
          atomic_inc(&printk_pending_logs);
        }
        goto retry;
    ---------- Pseudo code end ----------

    when printk() is called faster than print_one_log() can process a log.

    One of possible mitigation would be to introduce a new lock in order to
    make sure that no other series of printk() (either oom_kill_process() or
    warn_alloc()) can append to printk() buffer when one series of printk()
    (either oom_kill_process() or warn_alloc()) is already in progress.

    Such serialization will also help obtaining kernel messages in readable
    form.

    ---------- Pseudo code start ----------
      retry:
        if (mutex_trylock(&oom_lock)) {
          mutex_lock(&oom_printk_lock);
          while (atomic_read(&printk_pending_logs) > 0) {
            atomic_dec(&printk_pending_logs);
            print_one_log();
          }
          // Send SIGKILL here.
          mutex_unlock(&oom_printk_lock);
          mutex_unlock(&oom_lock)
        } else {
          if (mutex_trylock(&oom_printk_lock)) {
            atomic_inc(&printk_pending_logs);
            mutex_unlock(&oom_printk_lock);
          }
        }
        goto retry;
    ---------- Pseudo code end ----------

    But this commit does not go that direction, for we don't want to
    introduce a new lock dependency, and we unlikely be able to obtain
    useful information even if we serialized oom_kill_process() and
    warn_alloc().

    Synchronous approach is prone to unexpected results (e.g.  too late [1],
    too frequent [2], overlooked [3]).  As far as I know, warn_alloc() never
    helped with providing information other than "something is going wrong".
    I want to consider asynchronous approach which can obtain information
    during stalls with possibly relevant threads (e.g.  the owner of
    oom_lock and kswapd-like threads) and serve as a trigger for actions
    (e.g.  turn on/off tracepoints, ask libvirt daemon to take a memory dump
    of stalling KVM guest for diagnostic purpose).

    This commit temporarily loses ability to report e.g.  OOM lockup due to
    unable to invoke the OOM killer due to !__GFP_FS allocation request.
    But asynchronous approach will be able to detect such situation and emit
    warning.  Thus, let's remove warn_alloc().

    [1] https://bugzilla.kernel.org/show_bug.cgi?id=192981
    [2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com
    [3] commit db73ee0d46379922 ("mm, vmscan: do not loop on too_many_isolated for ever"))

    Link: http://lkml.kernel.org/r/1509017339-4802-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
    Reported-by: yuwang.yuwang <yuwang.yuwang@alibaba-inc.com>
    Reported-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
    Cc: Petr Mladek <pmladek@suse.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
----------

----------
commit 4837fe37adff1d159904f0c013471b1ecbcb455e
Author: Michal Hocko <mhocko@suse.com>
Date:   Thu Dec 14 15:33:15 2017 -0800

    mm, oom_reaper: fix memory corruption

    David Rientjes has reported the following memory corruption while the
    oom reaper tries to unmap the victims address space

      BUG: Bad page map in process oom_reaper  pte:6353826300000000 pmd:00000000
      addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping:          (null) index:7f50cab1d
      file:          (null) fault:          (null) mmap:          (null) readpage:          (null)
      CPU: 2 PID: 1001 Comm: oom_reaper
      Call Trace:
         unmap_page_range+0x1068/0x1130
         __oom_reap_task_mm+0xd5/0x16b
         oom_reaper+0xff/0x14c
         kthread+0xc1/0xe0

    Tetsuo Handa has noticed that the synchronization inside exit_mmap is
    insufficient.  We only synchronize with the oom reaper if
    tsk_is_oom_victim which is not true if the final __mmput is called from
    a different context than the oom victim exit path.  This can trivially
    happen from context of any task which has grabbed mm reference (e.g.  to
    read /proc/<pid>/ file which requires mm etc.).

    The race would look like this

      oom_reaper                oom_victim              task
                                                mmget_not_zero
                        do_exit
                          mmput
      __oom_reap_task_mm                                mmput
                                                  __mmput
                                                    exit_mmap
                                                      remove_vma
        unmap_page_range

    Fix this issue by providing a new mm_is_oom_victim() helper which
    operates on the mm struct rather than a task.  Any context which
    operates on a remote mm struct should use this helper in place of
    tsk_is_oom_victim.  The flag is set in mark_oom_victim and never cleared
    so it is stable in the exit_mmap path.

    Debugged by Tetsuo Handa.

    Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org
    Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Reported-by: David Rientjes <rientjes@google.com>
    Acked-by: David Rientjes <rientjes@google.com>
    Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Cc: Andrea Argangeli <andrea@kernel.org>
    Cc: <stable@vger.kernel.org>        [4.14]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
----------

There is no reason not to make sure processes sharing victim's mm have
same treatment regarding use of memory reserves.

You want to get rid of TIF_MEMDIE flag, but you rejected my patch which is
one of steps for getting rid of TIF_MEMDIE flag.

Whether a problem is seen in real life is a catch-22 discussion, for you
keep refusing/ignoring to provide a method for allowing normal users to
tell whether they hit that problem.

Please stop rejecting my patches without thinking deeply.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-12-23 15:34 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-07 11:42 [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation Tetsuo Handa
2017-12-07 11:51 ` Michal Hocko
2017-12-07 11:59   ` Tetsuo Handa
2017-12-07 12:22     ` Michal Hocko
2017-12-08 10:58       ` Tetsuo Handa
2017-12-11 11:15         ` [PATCH] mm, oom: task_will_free_mem() should ignore MMF_OOM_SKIP unless __GFP_NOFAIL Tetsuo Handa
2017-12-11 11:44           ` Michal Hocko
2017-12-11 11:42         ` [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation Michal Hocko
2017-12-12  8:09           ` Tetsuo Handa
2017-12-12 10:07             ` Michal Hocko
2017-12-11 11:57 ` Michal Hocko
2017-12-13 11:06   ` Tetsuo Handa
2017-12-19 14:36     ` Tetsuo Handa
2017-12-19 14:55       ` Michal Hocko
2017-12-21 15:34         ` Tetsuo Handa
2017-12-21 16:42           ` Michal Hocko
2017-12-23 14:41             ` Tetsuo Handa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.