All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@kernel.org>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Rientjes <rientjes@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Roman Gushchin <guro@fb.com>,
	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
Date: Thu, 3 Aug 2017 13:00:30 +0200	[thread overview]
Message-ID: <20170803110030.GJ12521@dhcp22.suse.cz> (raw)
In-Reply-To: <20170803093701.icju4mxmto3ls3ch@techsingularity.net>

On Thu 03-08-17 10:37:01, Mel Gorman wrote:
> On Wed, Aug 02, 2017 at 10:29:14AM +0200, Michal Hocko wrote:
> >     For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
> >     victims and then, among other things, to give these threads full
> >     access to memory reserves. There are few shortcomings of this
> >     implementation, though.
> >     
> 
> I believe the original intent was that allocations from the exit path
> would be small and relatively short-lived.

Yes.

> 
> >     First of all and the most serious one is that the full access to memory
> >     reserves is quite dangerous because we leave no safety room for the
> >     system to operate and potentially do last emergency steps to move on.
> >     
> >     Secondly this flag is per task_struct while the OOM killer operates
> >     on mm_struct granularity so all processes sharing the given mm are
> >     killed. Giving the full access to all these task_structs could lead to
> >     a quick memory reserves depletion.
> 
> This is a valid concern.
> 
> >     We have tried to reduce this risk by
> >     giving TIF_MEMDIE only to the main thread and the currently allocating
> >     task but that doesn't really solve this problem while it surely opens up
> >     a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside
> >     the allocator without access to memory reserves because a particular
> >     thread was not the group leader.
> >     
> >     Now that we have the oom reaper and that all oom victims are reapable
> >     after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
> >     kthreads") we can be more conservative and grant only partial access to
> >     memory reserves because there are reasonable chances of the parallel
> >     memory freeing. We still want some access to reserves because we do not
> >     want other consumers to eat up the victim's freed memory. oom victims
> >     will still contend with __GFP_HIGH users but those shouldn't be so
> >     aggressive to starve oom victims completely.
> >     
> >     Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
> >     the half of the reserves. This makes the access to reserves independent
> >     on which task has passed through mark_oom_victim. Also drop any
> >     usage of TIF_MEMDIE from the page allocator proper and replace it by
> >     tsk_is_oom_victim as well which will make page_alloc.c completely
> >     TIF_MEMDIE free finally.
> >     
> >     CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
> >     ALLOC_NO_WATERMARKS approach.
> >     
> >     There is a demand to make the oom killer memcg aware which will imply
> >     many tasks killed at once. This change will allow such a usecase without
> >     worrying about complete memory reserves depletion.
> >     
> >     Changes since v1
> >     - do not play tricks with nommu and grant access to memory reserves as
> >       long as TIF_MEMDIE is set
> >     - break out from allocation properly for oom victims as per Tetsuo
> >     - distinguish oom victims from other consumers of memory reserves in
> >       __gfp_pfmemalloc_flags - per Tetsuo
> >     
> >     Signed-off-by: Michal Hocko <mhocko@suse.com>
> > 
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 24d88f084705..1ebcb1ed01b5 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -480,6 +480,17 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
> >  /* Mask to get the watermark bits */
> >  #define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
> >  
> > +/*
> > + * Only MMU archs have async oom victim reclaim - aka oom_reaper so we
> > + * cannot assume a reduced access to memory reserves is sufficient for
> > + * !MMU
> > + */
> > +#ifdef CONFIG_MMU
> > +#define ALLOC_OOM		0x08
> > +#else
> > +#define ALLOC_OOM		ALLOC_NO_WATERMARKS
> > +#endif
> > +
> >  #define ALLOC_HARDER		0x10 /* try to alloc harder */
> >  #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
> >  #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
> 
> Ok, no collision with the wmark indexes so that should be fine. While I
> didn't check, I suspect that !MMU users also have relatively few CPUs to
> allow major contention.

Well, I didn't try to improve the !MMU case because a) I do not know
whether there is a real problem with oom depletion there and b) I have
no way to test this. So I only focused on keeping the status quo for
nommu.

[...]

> > @@ -2943,10 +2943,19 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
> >  	 * the high-atomic reserves. This will over-estimate the size of the
> >  	 * atomic reserve but it avoids a search.
> >  	 */
> > -	if (likely(!alloc_harder))
> > +	if (likely(!alloc_harder)) {
> >  		free_pages -= z->nr_reserved_highatomic;
> > -	else
> > -		min -= min / 4;
> > +	} else {
> > +		/*
> > +		 * OOM victims can try even harder than normal ALLOC_HARDER
> > +		 * users
> > +		 */
> > +		if (alloc_flags & ALLOC_OOM)
> > +			min -= min / 2;
> > +		else
> > +			min -= min / 4;
> > +	}
> > +
> >  
> >  #ifdef CONFIG_CMA
> >  	/* If allocation can't use CMA areas don't use free CMA pages */
> 
> I see no problem here although the comment could be slightly better. It
> suggests at OOM victims can try harder but not why
> 
> /*
>  * OOM victims can try even harder than normal ALLOC_HARDER users on the
>  * grounds that it's definitely going to be in the exit path shortly and
>  * free memory. Any allocation it makes during the free path will be
>  * small and short-lived.
>  */

OK, I have replaced the coment.
 
> > @@ -3603,21 +3612,46 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  	return alloc_flags;
> >  }
> >  
> > -bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > +static bool oom_reserves_allowed(struct task_struct *tsk)
> >  {
> > -	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
> > +	if (!tsk_is_oom_victim(tsk))
> > +		return false;
> > +
> > +	/*
> > +	 * !MMU doesn't have oom reaper so give access to memory reserves
> > +	 * only to the thread with TIF_MEMDIE set
> > +	 */
> > +	if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE))
> >  		return false;
> >  
> > +	return true;
> > +}
> > +
> 
> Ok, there is a chance that a task selected as an OOM kill victim may be
> in the middle of a __GFP_NOMEMALLOC allocation but I can't actually see a
> problem wiith that. __GFP_NOMEMALLOC users are not going to be in the exit
> path (which we care about for an OOM killed task) and the caller should
> always be able to handle a failure.

Not sure I understand. If the oom victim is doing __GFP_NOMEMALLOC then
we haven't been doing ALLOC_NO_WATERMARKS even before. So I preserve the
behavior here. Even though I am not sure this is a deliberate behavior
or something more of result of an evolution of the code.

> > +/*
> > + * Distinguish requests which really need access to full memory
> > + * reserves from oom victims which can live with a portion of it
> > + */
> > +static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
> > +{
> > +	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
> > +		return 0;
> >  	if (gfp_mask & __GFP_MEMALLOC)
> > -		return true;
> > +		return ALLOC_NO_WATERMARKS;
> >  	if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
> > -		return true;
> > -	if (!in_interrupt() &&
> > -			((current->flags & PF_MEMALLOC) ||
> > -			 unlikely(test_thread_flag(TIF_MEMDIE))))
> > -		return true;
> > +		return ALLOC_NO_WATERMARKS;
> > +	if (!in_interrupt()) {
> > +		if (current->flags & PF_MEMALLOC)
> > +			return ALLOC_NO_WATERMARKS;
> > +		else if (oom_reserves_allowed(current))
> > +			return ALLOC_OOM;
> > +	}
> >  
> > -	return false;
> > +	return 0;
> > +}
> > +
> > +bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > +{
> > +	return __gfp_pfmemalloc_flags(gfp_mask) > 0;
> >  }
> 
> Very subtle sign casing error here. If the flags ever use the high bit,
> this wraps and fails. It "shouldn't be possible" but you could just remove
> the "> 0" there to be on the safe side or have __gfp_pfmemalloc_flags
> return unsigned.

what about
	return !!__gfp_pfmemalloc_flags(gfp_mask);
 
> >  /*
> > @@ -3770,6 +3804,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	unsigned long alloc_start = jiffies;
> >  	unsigned int stall_timeout = 10 * HZ;
> >  	unsigned int cpuset_mems_cookie;
> > +	int reserves;
> >  
> 
> This should be explicitly named to indicate it's about flags and not the
> number of reserve pages or something else wacky.

s@reserves@reserve_flags@?

> >  	/*
> >  	 * In the slowpath, we sanity check order to avoid ever trying to
> > @@ -3875,15 +3910,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> >  		wake_all_kswapds(order, ac);
> >  
> > -	if (gfp_pfmemalloc_allowed(gfp_mask))
> > -		alloc_flags = ALLOC_NO_WATERMARKS;
> > +	reserves = __gfp_pfmemalloc_flags(gfp_mask);
> > +	if (reserves)
> > +		alloc_flags = reserves;
> >  
> 
> And if it's reserve_flags you can save a branch with
> 
> reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
> alloc_pags |= reserve_flags;
> 
> It won't make much difference considering how branch-intensive the allocator
> is anyway.

I was actually considering that but rather didn't want to do it because
I wanted to reset alloc_flags rather than create strange ALLOC_$FOO
combinations which would be harder to evaluate.
 
> >  	/*
> >  	 * Reset the zonelist iterators if memory policies can be ignored.
> >  	 * These allocations are high priority and system rather than user
> >  	 * orientated.
> >  	 */
> > -	if (!(alloc_flags & ALLOC_CPUSET) || (alloc_flags & ALLOC_NO_WATERMARKS)) {
> > +	if (!(alloc_flags & ALLOC_CPUSET) || reserves) {
> >  		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
> >  		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> >  					ac->high_zoneidx, ac->nodemask);
> > @@ -3960,8 +3996,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  		goto got_pg;
> >  
> >  	/* Avoid allocations with no watermarks from looping endlessly */
> > -	if (test_thread_flag(TIF_MEMDIE) &&
> > -	    (alloc_flags == ALLOC_NO_WATERMARKS ||
> > +	if (tsk_is_oom_victim(current) &&
> > +	    (alloc_flags == ALLOC_OOM ||
> >  	     (gfp_mask & __GFP_NOMEMALLOC)))
> >  		goto nopage;
> >  
> 
> Mostly I only found nit-picks so whether you address them or not
> 
> Acked-by: Mel Gorman <mgorman@techsingularity.net>

Thanks a lot for your review. Here is an incremental diff on top
---
commit 8483306e108a25441d796a8da3df5b85d633d859
Author: Michal Hocko <mhocko@suse.com>
Date:   Thu Aug 3 12:58:49 2017 +0200

    fold me "mm, oom: do not rely on TIF_MEMDIE for memory reserves access"
    
    - clarify access to memory reserves in __zone_watermark_ok - per Mel
    - make int->bool conversion in gfp_pfmemalloc_allowed more robust - per
      Mel
    - s@reserves@reserve_flags@ in __alloc_pages_slowpath - per Mel

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7ae0f6d45614..90e331e4c077 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2948,7 +2948,9 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 	} else {
 		/*
 		 * OOM victims can try even harder than normal ALLOC_HARDER
-		 * users
+		 * users on the grounds that it's definitely going to be in
+		 * the exit path shortly and free memory. Any allocation it
+		 * makes during the free path will be small and short-lived.
 		 */
 		if (alloc_flags & ALLOC_OOM)
 			min -= min / 2;
@@ -3651,7 +3653,7 @@ static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
 
 bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 {
-	return __gfp_pfmemalloc_flags(gfp_mask) > 0;
+	return !!__gfp_pfmemalloc_flags(gfp_mask);
 }
 
 /*
@@ -3804,7 +3806,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
 	unsigned int cpuset_mems_cookie;
-	int reserves;
+	int reserve_flags;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3910,16 +3912,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
-	reserves = __gfp_pfmemalloc_flags(gfp_mask);
-	if (reserves)
-		alloc_flags = reserves;
+	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
+	if (reserve_flags)
+		alloc_flags = reserve_flags;
 
 	/*
 	 * Reset the zonelist iterators if memory policies can be ignored.
 	 * These allocations are high priority and system rather than user
 	 * orientated.
 	 */
-	if (!(alloc_flags & ALLOC_CPUSET) || reserves) {
+	if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
 		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
 		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
 					ac->high_zoneidx, ac->nodemask);

-- 
Michal Hocko
SUSE Labs

WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko@kernel.org>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Rientjes <rientjes@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Roman Gushchin <guro@fb.com>,
	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2 1/2] mm, oom: do not rely on TIF_MEMDIE for memory reserves access
Date: Thu, 3 Aug 2017 13:00:30 +0200	[thread overview]
Message-ID: <20170803110030.GJ12521@dhcp22.suse.cz> (raw)
In-Reply-To: <20170803093701.icju4mxmto3ls3ch@techsingularity.net>

On Thu 03-08-17 10:37:01, Mel Gorman wrote:
> On Wed, Aug 02, 2017 at 10:29:14AM +0200, Michal Hocko wrote:
> >     For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
> >     victims and then, among other things, to give these threads full
> >     access to memory reserves. There are few shortcomings of this
> >     implementation, though.
> >     
> 
> I believe the original intent was that allocations from the exit path
> would be small and relatively short-lived.

Yes.

> 
> >     First of all and the most serious one is that the full access to memory
> >     reserves is quite dangerous because we leave no safety room for the
> >     system to operate and potentially do last emergency steps to move on.
> >     
> >     Secondly this flag is per task_struct while the OOM killer operates
> >     on mm_struct granularity so all processes sharing the given mm are
> >     killed. Giving the full access to all these task_structs could lead to
> >     a quick memory reserves depletion.
> 
> This is a valid concern.
> 
> >     We have tried to reduce this risk by
> >     giving TIF_MEMDIE only to the main thread and the currently allocating
> >     task but that doesn't really solve this problem while it surely opens up
> >     a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside
> >     the allocator without access to memory reserves because a particular
> >     thread was not the group leader.
> >     
> >     Now that we have the oom reaper and that all oom victims are reapable
> >     after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
> >     kthreads") we can be more conservative and grant only partial access to
> >     memory reserves because there are reasonable chances of the parallel
> >     memory freeing. We still want some access to reserves because we do not
> >     want other consumers to eat up the victim's freed memory. oom victims
> >     will still contend with __GFP_HIGH users but those shouldn't be so
> >     aggressive to starve oom victims completely.
> >     
> >     Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
> >     the half of the reserves. This makes the access to reserves independent
> >     on which task has passed through mark_oom_victim. Also drop any
> >     usage of TIF_MEMDIE from the page allocator proper and replace it by
> >     tsk_is_oom_victim as well which will make page_alloc.c completely
> >     TIF_MEMDIE free finally.
> >     
> >     CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
> >     ALLOC_NO_WATERMARKS approach.
> >     
> >     There is a demand to make the oom killer memcg aware which will imply
> >     many tasks killed at once. This change will allow such a usecase without
> >     worrying about complete memory reserves depletion.
> >     
> >     Changes since v1
> >     - do not play tricks with nommu and grant access to memory reserves as
> >       long as TIF_MEMDIE is set
> >     - break out from allocation properly for oom victims as per Tetsuo
> >     - distinguish oom victims from other consumers of memory reserves in
> >       __gfp_pfmemalloc_flags - per Tetsuo
> >     
> >     Signed-off-by: Michal Hocko <mhocko@suse.com>
> > 
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 24d88f084705..1ebcb1ed01b5 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -480,6 +480,17 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
> >  /* Mask to get the watermark bits */
> >  #define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
> >  
> > +/*
> > + * Only MMU archs have async oom victim reclaim - aka oom_reaper so we
> > + * cannot assume a reduced access to memory reserves is sufficient for
> > + * !MMU
> > + */
> > +#ifdef CONFIG_MMU
> > +#define ALLOC_OOM		0x08
> > +#else
> > +#define ALLOC_OOM		ALLOC_NO_WATERMARKS
> > +#endif
> > +
> >  #define ALLOC_HARDER		0x10 /* try to alloc harder */
> >  #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
> >  #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
> 
> Ok, no collision with the wmark indexes so that should be fine. While I
> didn't check, I suspect that !MMU users also have relatively few CPUs to
> allow major contention.

Well, I didn't try to improve the !MMU case because a) I do not know
whether there is a real problem with oom depletion there and b) I have
no way to test this. So I only focused on keeping the status quo for
nommu.

[...]

> > @@ -2943,10 +2943,19 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
> >  	 * the high-atomic reserves. This will over-estimate the size of the
> >  	 * atomic reserve but it avoids a search.
> >  	 */
> > -	if (likely(!alloc_harder))
> > +	if (likely(!alloc_harder)) {
> >  		free_pages -= z->nr_reserved_highatomic;
> > -	else
> > -		min -= min / 4;
> > +	} else {
> > +		/*
> > +		 * OOM victims can try even harder than normal ALLOC_HARDER
> > +		 * users
> > +		 */
> > +		if (alloc_flags & ALLOC_OOM)
> > +			min -= min / 2;
> > +		else
> > +			min -= min / 4;
> > +	}
> > +
> >  
> >  #ifdef CONFIG_CMA
> >  	/* If allocation can't use CMA areas don't use free CMA pages */
> 
> I see no problem here although the comment could be slightly better. It
> suggests at OOM victims can try harder but not why
> 
> /*
>  * OOM victims can try even harder than normal ALLOC_HARDER users on the
>  * grounds that it's definitely going to be in the exit path shortly and
>  * free memory. Any allocation it makes during the free path will be
>  * small and short-lived.
>  */

OK, I have replaced the coment.
 
> > @@ -3603,21 +3612,46 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  	return alloc_flags;
> >  }
> >  
> > -bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > +static bool oom_reserves_allowed(struct task_struct *tsk)
> >  {
> > -	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
> > +	if (!tsk_is_oom_victim(tsk))
> > +		return false;
> > +
> > +	/*
> > +	 * !MMU doesn't have oom reaper so give access to memory reserves
> > +	 * only to the thread with TIF_MEMDIE set
> > +	 */
> > +	if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE))
> >  		return false;
> >  
> > +	return true;
> > +}
> > +
> 
> Ok, there is a chance that a task selected as an OOM kill victim may be
> in the middle of a __GFP_NOMEMALLOC allocation but I can't actually see a
> problem wiith that. __GFP_NOMEMALLOC users are not going to be in the exit
> path (which we care about for an OOM killed task) and the caller should
> always be able to handle a failure.

Not sure I understand. If the oom victim is doing __GFP_NOMEMALLOC then
we haven't been doing ALLOC_NO_WATERMARKS even before. So I preserve the
behavior here. Even though I am not sure this is a deliberate behavior
or something more of result of an evolution of the code.

> > +/*
> > + * Distinguish requests which really need access to full memory
> > + * reserves from oom victims which can live with a portion of it
> > + */
> > +static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
> > +{
> > +	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
> > +		return 0;
> >  	if (gfp_mask & __GFP_MEMALLOC)
> > -		return true;
> > +		return ALLOC_NO_WATERMARKS;
> >  	if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
> > -		return true;
> > -	if (!in_interrupt() &&
> > -			((current->flags & PF_MEMALLOC) ||
> > -			 unlikely(test_thread_flag(TIF_MEMDIE))))
> > -		return true;
> > +		return ALLOC_NO_WATERMARKS;
> > +	if (!in_interrupt()) {
> > +		if (current->flags & PF_MEMALLOC)
> > +			return ALLOC_NO_WATERMARKS;
> > +		else if (oom_reserves_allowed(current))
> > +			return ALLOC_OOM;
> > +	}
> >  
> > -	return false;
> > +	return 0;
> > +}
> > +
> > +bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > +{
> > +	return __gfp_pfmemalloc_flags(gfp_mask) > 0;
> >  }
> 
> Very subtle sign casing error here. If the flags ever use the high bit,
> this wraps and fails. It "shouldn't be possible" but you could just remove
> the "> 0" there to be on the safe side or have __gfp_pfmemalloc_flags
> return unsigned.

what about
	return !!__gfp_pfmemalloc_flags(gfp_mask);
 
> >  /*
> > @@ -3770,6 +3804,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	unsigned long alloc_start = jiffies;
> >  	unsigned int stall_timeout = 10 * HZ;
> >  	unsigned int cpuset_mems_cookie;
> > +	int reserves;
> >  
> 
> This should be explicitly named to indicate it's about flags and not the
> number of reserve pages or something else wacky.

s@reserves@reserve_flags@?

> >  	/*
> >  	 * In the slowpath, we sanity check order to avoid ever trying to
> > @@ -3875,15 +3910,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> >  		wake_all_kswapds(order, ac);
> >  
> > -	if (gfp_pfmemalloc_allowed(gfp_mask))
> > -		alloc_flags = ALLOC_NO_WATERMARKS;
> > +	reserves = __gfp_pfmemalloc_flags(gfp_mask);
> > +	if (reserves)
> > +		alloc_flags = reserves;
> >  
> 
> And if it's reserve_flags you can save a branch with
> 
> reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
> alloc_pags |= reserve_flags;
> 
> It won't make much difference considering how branch-intensive the allocator
> is anyway.

I was actually considering that but rather didn't want to do it because
I wanted to reset alloc_flags rather than create strange ALLOC_$FOO
combinations which would be harder to evaluate.
 
> >  	/*
> >  	 * Reset the zonelist iterators if memory policies can be ignored.
> >  	 * These allocations are high priority and system rather than user
> >  	 * orientated.
> >  	 */
> > -	if (!(alloc_flags & ALLOC_CPUSET) || (alloc_flags & ALLOC_NO_WATERMARKS)) {
> > +	if (!(alloc_flags & ALLOC_CPUSET) || reserves) {
> >  		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
> >  		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> >  					ac->high_zoneidx, ac->nodemask);
> > @@ -3960,8 +3996,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  		goto got_pg;
> >  
> >  	/* Avoid allocations with no watermarks from looping endlessly */
> > -	if (test_thread_flag(TIF_MEMDIE) &&
> > -	    (alloc_flags == ALLOC_NO_WATERMARKS ||
> > +	if (tsk_is_oom_victim(current) &&
> > +	    (alloc_flags == ALLOC_OOM ||
> >  	     (gfp_mask & __GFP_NOMEMALLOC)))
> >  		goto nopage;
> >  
> 
> Mostly I only found nit-picks so whether you address them or not
> 
> Acked-by: Mel Gorman <mgorman@techsingularity.net>

Thanks a lot for your review. Here is an incremental diff on top
---
commit 8483306e108a25441d796a8da3df5b85d633d859
Author: Michal Hocko <mhocko@suse.com>
Date:   Thu Aug 3 12:58:49 2017 +0200

    fold me "mm, oom: do not rely on TIF_MEMDIE for memory reserves access"
    
    - clarify access to memory reserves in __zone_watermark_ok - per Mel
    - make int->bool conversion in gfp_pfmemalloc_allowed more robust - per
      Mel
    - s@reserves@reserve_flags@ in __alloc_pages_slowpath - per Mel

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7ae0f6d45614..90e331e4c077 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2948,7 +2948,9 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 	} else {
 		/*
 		 * OOM victims can try even harder than normal ALLOC_HARDER
-		 * users
+		 * users on the grounds that it's definitely going to be in
+		 * the exit path shortly and free memory. Any allocation it
+		 * makes during the free path will be small and short-lived.
 		 */
 		if (alloc_flags & ALLOC_OOM)
 			min -= min / 2;
@@ -3651,7 +3653,7 @@ static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
 
 bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 {
-	return __gfp_pfmemalloc_flags(gfp_mask) > 0;
+	return !!__gfp_pfmemalloc_flags(gfp_mask);
 }
 
 /*
@@ -3804,7 +3806,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
 	unsigned int cpuset_mems_cookie;
-	int reserves;
+	int reserve_flags;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3910,16 +3912,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
-	reserves = __gfp_pfmemalloc_flags(gfp_mask);
-	if (reserves)
-		alloc_flags = reserves;
+	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
+	if (reserve_flags)
+		alloc_flags = reserve_flags;
 
 	/*
 	 * Reset the zonelist iterators if memory policies can be ignored.
 	 * These allocations are high priority and system rather than user
 	 * orientated.
 	 */
-	if (!(alloc_flags & ALLOC_CPUSET) || reserves) {
+	if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
 		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
 		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
 					ac->high_zoneidx, ac->nodemask);

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-08-03 11:00 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-27  9:03 [PATCH 0/2] mm, oom: do not grant oom victims full memory reserves access Michal Hocko
2017-07-27  9:03 ` Michal Hocko
2017-07-27  9:03 ` [PATCH 1/2] mm, oom: do not rely on TIF_MEMDIE for " Michal Hocko
2017-07-27  9:03   ` Michal Hocko
2017-08-01 15:30   ` Tetsuo Handa
2017-08-01 15:30     ` Tetsuo Handa
2017-08-01 16:52     ` Michal Hocko
2017-08-01 16:52       ` Michal Hocko
2017-08-02  6:10       ` Michal Hocko
2017-08-02  6:10         ` Michal Hocko
2017-08-03  1:39       ` Tetsuo Handa
2017-08-03  1:39         ` Tetsuo Handa
2017-08-03  7:06         ` Michal Hocko
2017-08-03  7:06           ` Michal Hocko
2017-08-03  8:03           ` Tetsuo Handa
2017-08-03  8:03             ` Tetsuo Handa
2017-08-03  8:21             ` Michal Hocko
2017-08-03  8:21               ` Michal Hocko
2017-08-02  8:29   ` [PATCH v2 " Michal Hocko
2017-08-02  8:29     ` Michal Hocko
2017-08-03  9:37     ` Mel Gorman
2017-08-03  9:37       ` Mel Gorman
2017-08-03 11:00       ` Michal Hocko [this message]
2017-08-03 11:00         ` Michal Hocko
2017-08-03 12:22         ` Mel Gorman
2017-08-03 12:22           ` Mel Gorman
2017-07-27  9:03 ` [PATCH 2/2] mm: replace TIF_MEMDIE checks by tsk_is_oom_victim Michal Hocko
2017-07-27  9:03   ` Michal Hocko
2017-07-27 14:01   ` Tetsuo Handa
2017-07-27 14:01     ` Tetsuo Handa
2017-07-27 14:08     ` Tetsuo Handa
2017-07-27 14:08       ` Tetsuo Handa
2017-07-27 14:18     ` Michal Hocko
2017-07-27 14:18       ` Michal Hocko
2017-07-27 14:45     ` Michal Hocko
2017-07-27 14:45       ` Michal Hocko
2017-07-27 14:55       ` Roman Gushchin
2017-07-27 14:55         ` Roman Gushchin
2017-07-29  8:33   ` kbuild test robot
2017-07-31  6:46     ` Michal Hocko
2017-07-31  6:46       ` Michal Hocko
2017-08-01 12:16 ` [PATCH 0/2] mm, oom: do not grant oom victims full memory reserves access Michal Hocko
2017-08-01 12:16   ` Michal Hocko
2017-08-01 12:23   ` Roman Gushchin
2017-08-01 12:23     ` Roman Gushchin
2017-08-01 12:29     ` Michal Hocko
2017-08-01 12:29       ` Michal Hocko
2017-08-01 12:42       ` Roman Gushchin
2017-08-01 12:42         ` Roman Gushchin
2017-08-01 12:54         ` Michal Hocko
2017-08-01 12:54           ` Michal Hocko
2017-08-07 14:21 ` Michal Hocko
2017-08-07 14:21   ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170803110030.GJ12521@dhcp22.suse.cz \
    --to=mhocko@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=penguin-kernel@I-love.SAKURA.ne.jp \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.