All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: zram OOM behavior
@ 2012-11-02  6:39 Minchan Kim
  2012-11-02  8:30 ` Mel Gorman
  0 siblings, 1 reply; 20+ messages in thread
From: Minchan Kim @ 2012-11-02  6:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

Hi Mel,

On Thu, Nov 01, 2012 at 08:28:14AM +0000, Mel Gorman wrote:
> On Wed, Oct 31, 2012 at 09:48:57PM -0700, David Rientjes wrote:
> > On Thu, 1 Nov 2012, Minchan Kim wrote:
> > 
> > > It's not true any more.
> > > 3.6 includes following code in try_to_free_pages
> > > 
> > >         /*   
> > >          * Do not enter reclaim if fatal signal is pending. 1 is returned so
> > >          * that the page allocator does not consider triggering OOM
> > >          */
> > >         if (fatal_signal_pending(current))
> > >                 return 1;
> > > 
> > > So the hunged task never go to the OOM path and could be looping forever.
> > > 
> > 
> > Ah, interesting.  This is from commit 5515061d22f0 ("mm: throttle direct 
> > reclaimers if PF_MEMALLOC reserves are low and swap is backed by network 
> > storage").  Thanks for adding Mel to the cc.
> > 
> 
> Indeed, thanks.
> 
> > The oom killer specifically has logic for this condition: when calling 
> > out_of_memory() the first thing it does is
> > 
> > 	if (fatal_signal_pending(current))
> > 		set_thread_flag(TIF_MEMDIE);
> > 
> > to allow it access to memory reserves so that it may exit if it's having 
> > trouble.  But that ends up never happening because of the above code that 
> > Minchan has identified.
> > 
> > So we either need to do set_thread_flag(TIF_MEMDIE) in try_to_free_pages() 
> > as well or revert that early return entirely; there's no justification 
> > given for it in the comment nor in the commit log. 
> 
> The check for fatal signal is in the wrong place. The reason it was added
> is because a throttled process sleeps in an interruptible sleep.  If a user
> user forcibly kills a throttled process, it should not result in an OOM kill.
> 
> > I'd rather remove it 
> > and allow the oom killer to trigger and grant access to memory reserves 
> > itself if necessary.
> > 
> > Mel, how does commit 5515061d22f0 deal with threads looping forever if 
> > they need memory in the exit path since the oom killer never gets called?
> > 
> 
> It doesn't. How about this?
> 
> ---8<---
> mm: vmscan: Check for fatal signals iff the process was throttled
> 
> commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
> are low and swap is backed by network storage") introduced a check for
> fatal signals after a process gets throttled for network storage. The
> intention was that if a process was throttled and got killed that it
> should not trigger the OOM killer. As pointed out by Minchan Kim and
> David Rientjes, this check is in the wrong place and too broad. If a
> system is in am OOM situation and a process is exiting, it can loop in
> __alloc_pages_slowpath() and calling direct reclaim in a loop. As the
> fatal signal is pending it returns 1 as if it is making forward progress
> and can effectively deadlock.
> 
> This patch moves the fatal_signal_pending() check after throttling to
> throttle_direct_reclaim() where it belongs.

I'm not sure how below patch achieve your goal which is to prevent
unnecessary OOM kill if throttled process is killed by user during
throttling. If I misunderstood your goal, please correct me and
write down it in description for making it more clear.

If user kills throttled process, throttle_direct_reclaim returns true by
this patch so try_to_free_pages returns 1. It means it doesn't call OOM
in first path of reclaim but shortly it will try to reclaim again
by should_alloc_retry. And since this second path, throttle_direct_reclaim
will continue to return false so that it could end up calling OOM kill.

Is it a your intention? If so, what's different with old version?
This patch just delay OOM kill so what's benefit does it has?


> 
> If this patch passes review it should be considered a -stable candidate
> for 3.6.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |   37 +++++++++++++++++++++++++++----------
>  1 file changed, 27 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2b7edfa..ca9e37f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2238,9 +2238,12 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>   * Throttle direct reclaimers if backing storage is backed by the network
>   * and the PFMEMALLOC reserve for the preferred node is getting dangerously
>   * depleted. kswapd will continue to make progress and wake the processes
> - * when the low watermark is reached
> + * when the low watermark is reached.
> + *
> + * Returns true if a fatal signal was delivered during throttling. If this
> + * happens, the page allocator should not consider triggering the OOM killer.
>   */
> -static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> +static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>  					nodemask_t *nodemask)
>  {
>  	struct zone *zone;
> @@ -2255,13 +2258,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>  	 * processes to block on log_wait_commit().
>  	 */
>  	if (current->flags & PF_KTHREAD)
> -		return;
> +		goto out;
> +
> +	/*
> +	 * If a fatal signal is pending, this process should not throttle.
> +	 * It should return quickly so it can exit and free its memory
> +	 */
> +	if (fatal_signal_pending(current))
> +		goto out;
>  
>  	/* Check if the pfmemalloc reserves are ok */
>  	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
>  	pgdat = zone->zone_pgdat;
>  	if (pfmemalloc_watermark_ok(pgdat))
> -		return;
> +		goto out;
>  
>  	/* Account for the throttling */
>  	count_vm_event(PGSCAN_DIRECT_THROTTLE);
> @@ -2277,12 +2287,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>  	if (!(gfp_mask & __GFP_FS)) {
>  		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
>  			pfmemalloc_watermark_ok(pgdat), HZ);
> -		return;
> +
> +		goto check_pending;
>  	}
>  
>  	/* Throttle until kswapd wakes the process */
>  	wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
>  		pfmemalloc_watermark_ok(pgdat));
> +
> +check_pending:
> +	if (fatal_signal_pending(current))
> +		return true;
> +
> +out:
> +	return false;
>  }
>  
>  unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> @@ -2304,13 +2322,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  		.gfp_mask = sc.gfp_mask,
>  	};
>  
> -	throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
> -
>  	/*
> -	 * Do not enter reclaim if fatal signal is pending. 1 is returned so
> -	 * that the page allocator does not consider triggering OOM
> +	 * Do not enter reclaim if fatal signal was delivered while throttled.
> +	 * 1 is returned so that the page allocator does not OOM kill at this
> +	 * point.
>  	 */
> -	if (fatal_signal_pending(current))
> +	if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask))
>  		return 1;
>  
>  	trace_mm_vmscan_direct_reclaim_begin(order,
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: zram OOM behavior
  2012-11-02  6:39 zram OOM behavior Minchan Kim
@ 2012-11-02  8:30 ` Mel Gorman
  2012-11-02 22:36   ` Minchan Kim
  0 siblings, 1 reply; 20+ messages in thread
From: Mel Gorman @ 2012-11-02  8:30 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Fri, Nov 02, 2012 at 03:39:58PM +0900, Minchan Kim wrote:
> Hi Mel,
> 
> On Thu, Nov 01, 2012 at 08:28:14AM +0000, Mel Gorman wrote:
> > On Wed, Oct 31, 2012 at 09:48:57PM -0700, David Rientjes wrote:
> > > On Thu, 1 Nov 2012, Minchan Kim wrote:
> > > 
> > > > It's not true any more.
> > > > 3.6 includes following code in try_to_free_pages
> > > > 
> > > >         /*   
> > > >          * Do not enter reclaim if fatal signal is pending. 1 is returned so
> > > >          * that the page allocator does not consider triggering OOM
> > > >          */
> > > >         if (fatal_signal_pending(current))
> > > >                 return 1;
> > > > 
> > > > So the hunged task never go to the OOM path and could be looping forever.
> > > > 
> > > 
> > > Ah, interesting.  This is from commit 5515061d22f0 ("mm: throttle direct 
> > > reclaimers if PF_MEMALLOC reserves are low and swap is backed by network 
> > > storage").  Thanks for adding Mel to the cc.
> > > 
> > 
> > Indeed, thanks.
> > 
> > > The oom killer specifically has logic for this condition: when calling 
> > > out_of_memory() the first thing it does is
> > > 
> > > 	if (fatal_signal_pending(current))
> > > 		set_thread_flag(TIF_MEMDIE);
> > > 
> > > to allow it access to memory reserves so that it may exit if it's having 
> > > trouble.  But that ends up never happening because of the above code that 
> > > Minchan has identified.
> > > 
> > > So we either need to do set_thread_flag(TIF_MEMDIE) in try_to_free_pages() 
> > > as well or revert that early return entirely; there's no justification 
> > > given for it in the comment nor in the commit log. 
> > 
> > The check for fatal signal is in the wrong place. The reason it was added
> > is because a throttled process sleeps in an interruptible sleep.  If a user
> > user forcibly kills a throttled process, it should not result in an OOM kill.
> > 
> > > I'd rather remove it 
> > > and allow the oom killer to trigger and grant access to memory reserves 
> > > itself if necessary.
> > > 
> > > Mel, how does commit 5515061d22f0 deal with threads looping forever if 
> > > they need memory in the exit path since the oom killer never gets called?
> > > 
> > 
> > It doesn't. How about this?
> > 
> > ---8<---
> > mm: vmscan: Check for fatal signals iff the process was throttled
> > 
> > commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
> > are low and swap is backed by network storage") introduced a check for
> > fatal signals after a process gets throttled for network storage. The
> > intention was that if a process was throttled and got killed that it
> > should not trigger the OOM killer. As pointed out by Minchan Kim and
> > David Rientjes, this check is in the wrong place and too broad. If a
> > system is in am OOM situation and a process is exiting, it can loop in
> > __alloc_pages_slowpath() and calling direct reclaim in a loop. As the
> > fatal signal is pending it returns 1 as if it is making forward progress
> > and can effectively deadlock.
> > 
> > This patch moves the fatal_signal_pending() check after throttling to
> > throttle_direct_reclaim() where it belongs.
> 
> I'm not sure how below patch achieve your goal which is to prevent
> unnecessary OOM kill if throttled process is killed by user during
> throttling. If I misunderstood your goal, please correct me and
> write down it in description for making it more clear.
> 
> If user kills throttled process, throttle_direct_reclaim returns true by
> this patch so try_to_free_pages returns 1. It means it doesn't call OOM
> in first path of reclaim but shortly it will try to reclaim again
> by should_alloc_retry.

Yes and it returned without calling direct reclaim.

> And since this second path, throttle_direct_reclaim
> will continue to return false so that it could end up calling OOM kill.
> 

Yes except the second time it has not been throttled and it entered direct
reclaim. If it fails to make any progress it will return 0 but if this
happens, it potentially really is an OOM situation. If it manages to
reclaim, it'll be returning a positive number, is making forward
progress and should successfully exit without triggering OOM.

Note that throttle_direct_reclaim also now checks fatal_signal_pending
before deciding to throttle at all.

> Is it a your intention? If so, what's different with old version?
> This patch just delay OOM kill so what's benefit does it has?
> 

In the first version it would never try to enter direct reclaim if a
fatal signal was pending but always claim that forward progress was
being made.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: zram OOM behavior
  2012-11-02  8:30 ` Mel Gorman
@ 2012-11-02 22:36   ` Minchan Kim
  2012-11-05 14:46     ` Mel Gorman
  0 siblings, 1 reply; 20+ messages in thread
From: Minchan Kim @ 2012-11-02 22:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Fri, Nov 02, 2012 at 08:30:57AM +0000, Mel Gorman wrote:
> On Fri, Nov 02, 2012 at 03:39:58PM +0900, Minchan Kim wrote:
> > Hi Mel,
> > 
> > On Thu, Nov 01, 2012 at 08:28:14AM +0000, Mel Gorman wrote:
> > > On Wed, Oct 31, 2012 at 09:48:57PM -0700, David Rientjes wrote:
> > > > On Thu, 1 Nov 2012, Minchan Kim wrote:
> > > > 
> > > > > It's not true any more.
> > > > > 3.6 includes following code in try_to_free_pages
> > > > > 
> > > > >         /*   
> > > > >          * Do not enter reclaim if fatal signal is pending. 1 is returned so
> > > > >          * that the page allocator does not consider triggering OOM
> > > > >          */
> > > > >         if (fatal_signal_pending(current))
> > > > >                 return 1;
> > > > > 
> > > > > So the hunged task never go to the OOM path and could be looping forever.
> > > > > 
> > > > 
> > > > Ah, interesting.  This is from commit 5515061d22f0 ("mm: throttle direct 
> > > > reclaimers if PF_MEMALLOC reserves are low and swap is backed by network 
> > > > storage").  Thanks for adding Mel to the cc.
> > > > 
> > > 
> > > Indeed, thanks.
> > > 
> > > > The oom killer specifically has logic for this condition: when calling 
> > > > out_of_memory() the first thing it does is
> > > > 
> > > > 	if (fatal_signal_pending(current))
> > > > 		set_thread_flag(TIF_MEMDIE);
> > > > 
> > > > to allow it access to memory reserves so that it may exit if it's having 
> > > > trouble.  But that ends up never happening because of the above code that 
> > > > Minchan has identified.
> > > > 
> > > > So we either need to do set_thread_flag(TIF_MEMDIE) in try_to_free_pages() 
> > > > as well or revert that early return entirely; there's no justification 
> > > > given for it in the comment nor in the commit log. 
> > > 
> > > The check for fatal signal is in the wrong place. The reason it was added
> > > is because a throttled process sleeps in an interruptible sleep.  If a user
> > > user forcibly kills a throttled process, it should not result in an OOM kill.
> > > 
> > > > I'd rather remove it 
> > > > and allow the oom killer to trigger and grant access to memory reserves 
> > > > itself if necessary.
> > > > 
> > > > Mel, how does commit 5515061d22f0 deal with threads looping forever if 
> > > > they need memory in the exit path since the oom killer never gets called?
> > > > 
> > > 
> > > It doesn't. How about this?
> > > 
> > > ---8<---
> > > mm: vmscan: Check for fatal signals iff the process was throttled
> > > 
> > > commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
> > > are low and swap is backed by network storage") introduced a check for
> > > fatal signals after a process gets throttled for network storage. The
> > > intention was that if a process was throttled and got killed that it
> > > should not trigger the OOM killer. As pointed out by Minchan Kim and
> > > David Rientjes, this check is in the wrong place and too broad. If a
> > > system is in am OOM situation and a process is exiting, it can loop in
> > > __alloc_pages_slowpath() and calling direct reclaim in a loop. As the
> > > fatal signal is pending it returns 1 as if it is making forward progress
> > > and can effectively deadlock.
> > > 
> > > This patch moves the fatal_signal_pending() check after throttling to
> > > throttle_direct_reclaim() where it belongs.
> > 
> > I'm not sure how below patch achieve your goal which is to prevent
> > unnecessary OOM kill if throttled process is killed by user during
> > throttling. If I misunderstood your goal, please correct me and
> > write down it in description for making it more clear.
> > 
> > If user kills throttled process, throttle_direct_reclaim returns true by
> > this patch so try_to_free_pages returns 1. It means it doesn't call OOM
> > in first path of reclaim but shortly it will try to reclaim again
> > by should_alloc_retry.
> 
> Yes and it returned without calling direct reclaim.
> 
> > And since this second path, throttle_direct_reclaim
> > will continue to return false so that it could end up calling OOM kill.
> > 
> 
> Yes except the second time it has not been throttled and it entered direct
> reclaim. If it fails to make any progress it will return 0 but if this
> happens, it potentially really is an OOM situation. If it manages to
> reclaim, it'll be returning a positive number, is making forward
> progress and should successfully exit without triggering OOM.
> 
> Note that throttle_direct_reclaim also now checks fatal_signal_pending
> before deciding to throttle at all.
> 
> > Is it a your intention? If so, what's different with old version?
> > This patch just delay OOM kill so what's benefit does it has?
> > 
> 
> In the first version it would never try to enter direct reclaim if a
> fatal signal was pending but always claim that forward progress was
> being made.

Surely we need fix for preventing deadlock with OOM kill and that's why
I have Cced you and this patch fixes it but my question is why we need 
such fatal signal checking trick.

How about this?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 10090c8..881619e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2306,13 +2306,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 
        throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
 
-       /*
-        * Do not enter reclaim if fatal signal is pending. 1 is returned so
-        * that the page allocator does not consider triggering OOM
-        */
-       if (fatal_signal_pending(current))
-               return 1;
-
        trace_mm_vmscan_direct_reclaim_begin(order,
                                sc.may_writepage,
                                gfp_mask);
 
In this case, after throttling, current will try to do direct reclaim and
if he makes forward progress, he will get a memory and exit if he receive KILL signal.
If he can't make forward progress with direct reclaim, he can ends up OOM path but
out_of_memory checks signal check of current and allow to access reserved memory pool
for quick exit and return without killing other victim selection.
Is it a problem for your case?

> 
> -- 
> Mel Gorman
> SUSE Labs

-- 
Kind Regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: zram OOM behavior
  2012-11-02 22:36   ` Minchan Kim
@ 2012-11-05 14:46     ` Mel Gorman
  2012-11-06  0:25       ` Minchan Kim
  0 siblings, 1 reply; 20+ messages in thread
From: Mel Gorman @ 2012-11-05 14:46 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Sat, Nov 03, 2012 at 07:36:31AM +0900, Minchan Kim wrote:
> > <SNIP>
> > In the first version it would never try to enter direct reclaim if a
> > fatal signal was pending but always claim that forward progress was
> > being made.
> 
> Surely we need fix for preventing deadlock with OOM kill and that's why
> I have Cced you and this patch fixes it but my question is why we need 
> such fatal signal checking trick.
> 
> How about this?
> 

Both will work as expected but....

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 10090c8..881619e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2306,13 +2306,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  
>         throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
>  
> -       /*
> -        * Do not enter reclaim if fatal signal is pending. 1 is returned so
> -        * that the page allocator does not consider triggering OOM
> -        */
> -       if (fatal_signal_pending(current))
> -               return 1;
> -
>         trace_mm_vmscan_direct_reclaim_begin(order,
>                                 sc.may_writepage,
>                                 gfp_mask);
>  
> In this case, after throttling, current will try to do direct reclaim and
> if he makes forward progress, he will get a memory and exit if he receive KILL signal.

It may be completely unnecessary to reclaim memory if the process that was
throttled and killed just exits quickly. As the fatal signal is pending
it will be able to use the pfmemalloc reserves.

> If he can't make forward progress with direct reclaim, he can ends up OOM path but
> out_of_memory checks signal check of current and allow to access reserved memory pool
> for quick exit and return without killing other victim selection.

While this is true, what advantage is there to having a killed process
potentially reclaiming memory it does not need to?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: zram OOM behavior
  2012-11-05 14:46     ` Mel Gorman
@ 2012-11-06  0:25       ` Minchan Kim
  2012-11-06  8:58         ` Mel Gorman
  0 siblings, 1 reply; 20+ messages in thread
From: Minchan Kim @ 2012-11-06  0:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Mon, Nov 05, 2012 at 02:46:14PM +0000, Mel Gorman wrote:
> On Sat, Nov 03, 2012 at 07:36:31AM +0900, Minchan Kim wrote:
> > > <SNIP>
> > > In the first version it would never try to enter direct reclaim if a
> > > fatal signal was pending but always claim that forward progress was
> > > being made.
> > 
> > Surely we need fix for preventing deadlock with OOM kill and that's why
> > I have Cced you and this patch fixes it but my question is why we need 
> > such fatal signal checking trick.
> > 
> > How about this?
> > 
> 
> Both will work as expected but....
> 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 10090c8..881619e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2306,13 +2306,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> >  
> >         throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
> >  
> > -       /*
> > -        * Do not enter reclaim if fatal signal is pending. 1 is returned so
> > -        * that the page allocator does not consider triggering OOM
> > -        */
> > -       if (fatal_signal_pending(current))
> > -               return 1;
> > -
> >         trace_mm_vmscan_direct_reclaim_begin(order,
> >                                 sc.may_writepage,
> >                                 gfp_mask);
> >  
> > In this case, after throttling, current will try to do direct reclaim and
> > if he makes forward progress, he will get a memory and exit if he receive KILL signal.
> 
> It may be completely unnecessary to reclaim memory if the process that was
> throttled and killed just exits quickly. As the fatal signal is pending
> it will be able to use the pfmemalloc reserves.
> 
> > If he can't make forward progress with direct reclaim, he can ends up OOM path but
> > out_of_memory checks signal check of current and allow to access reserved memory pool
> > for quick exit and return without killing other victim selection.
> 
> While this is true, what advantage is there to having a killed process
> potentially reclaiming memory it does not need to?

Killed process needs a memory for him to be terminated. I think it's not a good idea for him
to use reserved memory pool unconditionally although he is throtlled and killed.
Because reserved memory pool is very stricted resource for emergency so using reserved memory
pool should be last resort after he fail to reclaim.

> 
> -- 
> Mel Gorman
> SUSE Labs

-- 
Kind Regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: zram OOM behavior
  2012-11-06  0:25       ` Minchan Kim
@ 2012-11-06  8:58         ` Mel Gorman
  2012-11-06 10:17           ` Minchan Kim
  0 siblings, 1 reply; 20+ messages in thread
From: Mel Gorman @ 2012-11-06  8:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Tue, Nov 06, 2012 at 09:25:50AM +0900, Minchan Kim wrote:
> On Mon, Nov 05, 2012 at 02:46:14PM +0000, Mel Gorman wrote:
> > On Sat, Nov 03, 2012 at 07:36:31AM +0900, Minchan Kim wrote:
> > > > <SNIP>
> > > > In the first version it would never try to enter direct reclaim if a
> > > > fatal signal was pending but always claim that forward progress was
> > > > being made.
> > > 
> > > Surely we need fix for preventing deadlock with OOM kill and that's why
> > > I have Cced you and this patch fixes it but my question is why we need 
> > > such fatal signal checking trick.
> > > 
> > > How about this?
> > > 
> > 
> > Both will work as expected but....
> > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 10090c8..881619e 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2306,13 +2306,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> > >  
> > >         throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
> > >  
> > > -       /*
> > > -        * Do not enter reclaim if fatal signal is pending. 1 is returned so
> > > -        * that the page allocator does not consider triggering OOM
> > > -        */
> > > -       if (fatal_signal_pending(current))
> > > -               return 1;
> > > -
> > >         trace_mm_vmscan_direct_reclaim_begin(order,
> > >                                 sc.may_writepage,
> > >                                 gfp_mask);
> > >  
> > > In this case, after throttling, current will try to do direct reclaim and
> > > if he makes forward progress, he will get a memory and exit if he receive KILL signal.
> > 
> > It may be completely unnecessary to reclaim memory if the process that was
> > throttled and killed just exits quickly. As the fatal signal is pending
> > it will be able to use the pfmemalloc reserves.
> > 
> > > If he can't make forward progress with direct reclaim, he can ends up OOM path but
> > > out_of_memory checks signal check of current and allow to access reserved memory pool
> > > for quick exit and return without killing other victim selection.
> > 
> > While this is true, what advantage is there to having a killed process
> > potentially reclaiming memory it does not need to?
> 
> Killed process needs a memory for him to be terminated. I think it's not a good idea for him
> to use reserved memory pool unconditionally although he is throtlled and killed.
> Because reserved memory pool is very stricted resource for emergency so using reserved memory
> pool should be last resort after he fail to reclaim.
> 

Part of that reclaim can be the process reclaiming its own pages and
putting them in swap just so it can exit shortly afterwards. If it was
throttled in this path, it implies that swap-over-NFS is enabled where
such reclaim in fact might require the pfmemalloc reserves to be used to
allocate network buffers. It's potentially unnecessary work because the
same reserves could have been used to just exit the process.

I'll go your way if you insist because it's not like getting throttled
and killed before exit is a common situation and it should work either
way.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: zram OOM behavior
  2012-11-06  8:58         ` Mel Gorman
@ 2012-11-06 10:17           ` Minchan Kim
  2012-11-09  9:50             ` Mel Gorman
  0 siblings, 1 reply; 20+ messages in thread
From: Minchan Kim @ 2012-11-06 10:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Tue, Nov 06, 2012 at 08:58:22AM +0000, Mel Gorman wrote:
> On Tue, Nov 06, 2012 at 09:25:50AM +0900, Minchan Kim wrote:
> > On Mon, Nov 05, 2012 at 02:46:14PM +0000, Mel Gorman wrote:
> > > On Sat, Nov 03, 2012 at 07:36:31AM +0900, Minchan Kim wrote:
> > > > > <SNIP>
> > > > > In the first version it would never try to enter direct reclaim if a
> > > > > fatal signal was pending but always claim that forward progress was
> > > > > being made.
> > > > 
> > > > Surely we need fix for preventing deadlock with OOM kill and that's why
> > > > I have Cced you and this patch fixes it but my question is why we need 
> > > > such fatal signal checking trick.
> > > > 
> > > > How about this?
> > > > 
> > > 
> > > Both will work as expected but....
> > > 
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 10090c8..881619e 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2306,13 +2306,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> > > >  
> > > >         throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
> > > >  
> > > > -       /*
> > > > -        * Do not enter reclaim if fatal signal is pending. 1 is returned so
> > > > -        * that the page allocator does not consider triggering OOM
> > > > -        */
> > > > -       if (fatal_signal_pending(current))
> > > > -               return 1;
> > > > -
> > > >         trace_mm_vmscan_direct_reclaim_begin(order,
> > > >                                 sc.may_writepage,
> > > >                                 gfp_mask);
> > > >  
> > > > In this case, after throttling, current will try to do direct reclaim and
> > > > if he makes forward progress, he will get a memory and exit if he receive KILL signal.
> > > 
> > > It may be completely unnecessary to reclaim memory if the process that was
> > > throttled and killed just exits quickly. As the fatal signal is pending
> > > it will be able to use the pfmemalloc reserves.
> > > 
> > > > If he can't make forward progress with direct reclaim, he can ends up OOM path but
> > > > out_of_memory checks signal check of current and allow to access reserved memory pool
> > > > for quick exit and return without killing other victim selection.
> > > 
> > > While this is true, what advantage is there to having a killed process
> > > potentially reclaiming memory it does not need to?
> > 
> > Killed process needs a memory for him to be terminated. I think it's not a good idea for him
> > to use reserved memory pool unconditionally although he is throtlled and killed.
> > Because reserved memory pool is very stricted resource for emergency so using reserved memory
> > pool should be last resort after he fail to reclaim.
> > 
> 
> Part of that reclaim can be the process reclaiming its own pages and
> putting them in swap just so it can exit shortly afterwards. If it was
> throttled in this path, it implies that swap-over-NFS is enabled where

Could we make sure it's only the case for swap-over-NFS?
I think it can happen if the system has very slow thumb card.

> such reclaim in fact might require the pfmemalloc reserves to be used to
> allocate network buffers. It's potentially unnecessary work because the

You mean we need pfmemalloc reserve to swap out anon pages by swap-over-NFS?
Yes. In this case, you're right. I would be better to use reserve pool for
just exiting instead of swap out over network. But how can you make sure that
we have only anonymous page when we try to reclaim? 
If there are some file-backed pages, we can avoid swapout at that time.
Maybe we need some check.

> same reserves could have been used to just exit the process.
> 
> I'll go your way if you insist because it's not like getting throttled
> and killed before exit is a common situation and it should work either
> way.

I don't want to insist on. Just want to know what's the problem and find
better solution. :) 

P.S) I'm at situation which is very hard to sit down in front of computer
for a long time due to really really thanksful training course. :(
Shortly, I should go to dance.
Please feel free to send patch without expectation I will send patch soon.

> 
> -- 
> Mel Gorman
> SUSE Labs

-- 
Kind Regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: zram OOM behavior
  2012-11-06 10:17           ` Minchan Kim
@ 2012-11-09  9:50             ` Mel Gorman
  2012-11-12 13:32               ` Minchan Kim
  0 siblings, 1 reply; 20+ messages in thread
From: Mel Gorman @ 2012-11-09  9:50 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Tue, Nov 06, 2012 at 07:17:20PM +0900, Minchan Kim wrote:
> On Tue, Nov 06, 2012 at 08:58:22AM +0000, Mel Gorman wrote:
> > On Tue, Nov 06, 2012 at 09:25:50AM +0900, Minchan Kim wrote:
> > > On Mon, Nov 05, 2012 at 02:46:14PM +0000, Mel Gorman wrote:
> > > > On Sat, Nov 03, 2012 at 07:36:31AM +0900, Minchan Kim wrote:
> > > > > > <SNIP>
> > > > > > In the first version it would never try to enter direct reclaim if a
> > > > > > fatal signal was pending but always claim that forward progress was
> > > > > > being made.
> > > > > 
> > > > > Surely we need fix for preventing deadlock with OOM kill and that's why
> > > > > I have Cced you and this patch fixes it but my question is why we need 
> > > > > such fatal signal checking trick.
> > > > > 
> > > > > How about this?
> > > > > 
> > > > 
> > > > Both will work as expected but....
> > > > 
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index 10090c8..881619e 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -2306,13 +2306,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> > > > >  
> > > > >         throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
> > > > >  
> > > > > -       /*
> > > > > -        * Do not enter reclaim if fatal signal is pending. 1 is returned so
> > > > > -        * that the page allocator does not consider triggering OOM
> > > > > -        */
> > > > > -       if (fatal_signal_pending(current))
> > > > > -               return 1;
> > > > > -
> > > > >         trace_mm_vmscan_direct_reclaim_begin(order,
> > > > >                                 sc.may_writepage,
> > > > >                                 gfp_mask);
> > > > >  
> > > > > In this case, after throttling, current will try to do direct reclaim and
> > > > > if he makes forward progress, he will get a memory and exit if he receive KILL signal.
> > > > 
> > > > It may be completely unnecessary to reclaim memory if the process that was
> > > > throttled and killed just exits quickly. As the fatal signal is pending
> > > > it will be able to use the pfmemalloc reserves.
> > > > 
> > > > > If he can't make forward progress with direct reclaim, he can ends up OOM path but
> > > > > out_of_memory checks signal check of current and allow to access reserved memory pool
> > > > > for quick exit and return without killing other victim selection.
> > > > 
> > > > While this is true, what advantage is there to having a killed process
> > > > potentially reclaiming memory it does not need to?
> > > 
> > > Killed process needs a memory for him to be terminated. I think it's not a good idea for him
> > > to use reserved memory pool unconditionally although he is throtlled and killed.
> > > Because reserved memory pool is very stricted resource for emergency so using reserved memory
> > > pool should be last resort after he fail to reclaim.
> > > 
> > 
> > Part of that reclaim can be the process reclaiming its own pages and
> > putting them in swap just so it can exit shortly afterwards. If it was
> > throttled in this path, it implies that swap-over-NFS is enabled where
> 
> Could we make sure it's only the case for swap-over-NFS?

The PFMEMALLOC reserves being consumed to the point of throttline is only
expected in the case of swap-over-network -- check the pgscan_direct_throttle
counter to be sure. So it's already the case that this throttling logic and
its signal handling is mostly a swap-over-NFS thing. It is possible that
a badly behaving driver using GFP_ATOMIC to allocate long-lived buffers
could force a situation where a process gets throttled but I'm not aware
of a case where this happens todays.

> I think it can happen if the system has very slow thumb card.
> 

How? They shouldn't be stuck in throttling in this case. They should be
blocked on IO, congestion wait, dirty throttling etc.

> > such reclaim in fact might require the pfmemalloc reserves to be used to
> > allocate network buffers. It's potentially unnecessary work because the
> 
> You mean we need pfmemalloc reserve to swap out anon pages by swap-over-NFS?

In very low-memory situations - yes. We can be at the min watermark but
still need to allocate a page for a network buffer to swap out the anon page.

> Yes. In this case, you're right. I would be better to use reserve pool for
> just exiting instead of swap out over network. But how can you make sure that
> we have only anonymous page when we try to reclaim? 
> If there are some file-backed pages, we can avoid swapout at that time.
> Maybe we need some check.
> 

That would be a fairly invasive set of checks for a corner case. if
swap-over-nfs + critically low + about to OOM + file pages available then
only reclaim files.

It's getting off track as to why we're having this discussion in the first
place -- looping due to improper handling of fatal signal pending.

> > same reserves could have been used to just exit the process.
> > 
> > I'll go your way if you insist because it's not like getting throttled
> > and killed before exit is a common situation and it should work either
> > way.
> 
> I don't want to insist on. Just want to know what's the problem and find
> better solution. :) 
> 

In that case, I'm going to send the patch to Andrew on Monday and avoid
direct reclaim when a fatal signal is pending in the swap-over-network
case. Are you ok with that?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: zram OOM behavior
  2012-11-09  9:50             ` Mel Gorman
@ 2012-11-12 13:32               ` Minchan Kim
  2012-11-12 14:06                 ` Mel Gorman
  0 siblings, 1 reply; 20+ messages in thread
From: Minchan Kim @ 2012-11-12 13:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

Sorry for the late reply.
I'm still going on training course until this week so my response would be delayed, too.

On Fri, Nov 09, 2012 at 09:50:24AM +0000, Mel Gorman wrote:
> On Tue, Nov 06, 2012 at 07:17:20PM +0900, Minchan Kim wrote:
> > On Tue, Nov 06, 2012 at 08:58:22AM +0000, Mel Gorman wrote:
> > > On Tue, Nov 06, 2012 at 09:25:50AM +0900, Minchan Kim wrote:
> > > > On Mon, Nov 05, 2012 at 02:46:14PM +0000, Mel Gorman wrote:
> > > > > On Sat, Nov 03, 2012 at 07:36:31AM +0900, Minchan Kim wrote:
> > > > > > > <SNIP>
> > > > > > > In the first version it would never try to enter direct reclaim if a
> > > > > > > fatal signal was pending but always claim that forward progress was
> > > > > > > being made.
> > > > > > 
> > > > > > Surely we need fix for preventing deadlock with OOM kill and that's why
> > > > > > I have Cced you and this patch fixes it but my question is why we need 
> > > > > > such fatal signal checking trick.
> > > > > > 
> > > > > > How about this?
> > > > > > 
> > > > > 
> > > > > Both will work as expected but....
> > > > > 
> > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > index 10090c8..881619e 100644
> > > > > > --- a/mm/vmscan.c
> > > > > > +++ b/mm/vmscan.c
> > > > > > @@ -2306,13 +2306,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> > > > > >  
> > > > > >         throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
> > > > > >  
> > > > > > -       /*
> > > > > > -        * Do not enter reclaim if fatal signal is pending. 1 is returned so
> > > > > > -        * that the page allocator does not consider triggering OOM
> > > > > > -        */
> > > > > > -       if (fatal_signal_pending(current))
> > > > > > -               return 1;
> > > > > > -
> > > > > >         trace_mm_vmscan_direct_reclaim_begin(order,
> > > > > >                                 sc.may_writepage,
> > > > > >                                 gfp_mask);
> > > > > >  
> > > > > > In this case, after throttling, current will try to do direct reclaim and
> > > > > > if he makes forward progress, he will get a memory and exit if he receive KILL signal.
> > > > > 
> > > > > It may be completely unnecessary to reclaim memory if the process that was
> > > > > throttled and killed just exits quickly. As the fatal signal is pending
> > > > > it will be able to use the pfmemalloc reserves.
> > > > > 
> > > > > > If he can't make forward progress with direct reclaim, he can ends up OOM path but
> > > > > > out_of_memory checks signal check of current and allow to access reserved memory pool
> > > > > > for quick exit and return without killing other victim selection.
> > > > > 
> > > > > While this is true, what advantage is there to having a killed process
> > > > > potentially reclaiming memory it does not need to?
> > > > 
> > > > Killed process needs a memory for him to be terminated. I think it's not a good idea for him
> > > > to use reserved memory pool unconditionally although he is throtlled and killed.
> > > > Because reserved memory pool is very stricted resource for emergency so using reserved memory
> > > > pool should be last resort after he fail to reclaim.
> > > > 
> > > 
> > > Part of that reclaim can be the process reclaiming its own pages and
> > > putting them in swap just so it can exit shortly afterwards. If it was
> > > throttled in this path, it implies that swap-over-NFS is enabled where
> > 
> > Could we make sure it's only the case for swap-over-NFS?
> 
> The PFMEMALLOC reserves being consumed to the point of throttline is only
> expected in the case of swap-over-network -- check the pgscan_direct_throttle
> counter to be sure. So it's already the case that this throttling logic and
> its signal handling is mostly a swap-over-NFS thing. It is possible that
> a badly behaving driver using GFP_ATOMIC to allocate long-lived buffers
> could force a situation where a process gets throttled but I'm not aware
> of a case where this happens todays.

I saw some custom drviers in embedded side have used GFP_ATOMIC easily to protect
avoiding deadlock. Of course, it's not a good behavior but it lives with us.
Even, we can't fix it because we don't have any source. :(

> 
> > I think it can happen if the system has very slow thumb card.
> > 
> 
> How? They shouldn't be stuck in throttling in this case. They should be
> blocked on IO, congestion wait, dirty throttling etc.

Some block driver(ex, mmc) uses a thread model with PF_MEMALLOC so I think
they can be stucked by the throttling logic.

> 
> > > such reclaim in fact might require the pfmemalloc reserves to be used to
> > > allocate network buffers. It's potentially unnecessary work because the
> > 
> > You mean we need pfmemalloc reserve to swap out anon pages by swap-over-NFS?
> 
> In very low-memory situations - yes. We can be at the min watermark but
> still need to allocate a page for a network buffer to swap out the anon page.
> 
> > Yes. In this case, you're right. I would be better to use reserve pool for
> > just exiting instead of swap out over network. But how can you make sure that
> > we have only anonymous page when we try to reclaim? 
> > If there are some file-backed pages, we can avoid swapout at that time.
> > Maybe we need some check.
> > 
> 
> That would be a fairly invasive set of checks for a corner case. if
> swap-over-nfs + critically low + about to OOM + file pages available then
> only reclaim files.
> 
> It's getting off track as to why we're having this discussion in the first
> place -- looping due to improper handling of fatal signal pending.

If some user tune /proc/sys/vm/swappiness, we could have many page cache pages
when swap-over-NFS happens.
My point is that why do we should use emergency memory pool although we have
reclaimalble memory?

> 
> > > same reserves could have been used to just exit the process.
> > > 
> > > I'll go your way if you insist because it's not like getting throttled
> > > and killed before exit is a common situation and it should work either
> > > way.
> > 
> > I don't want to insist on. Just want to know what's the problem and find
> > better solution. :) 
> > 
> 
> In that case, I'm going to send the patch to Andrew on Monday and avoid
> direct reclaim when a fatal signal is pending in the swap-over-network
> case. Are you ok with that?

Sorry but I don't think your patch is best approach.

> 
> -- 
> Mel Gorman
> SUSE Labs

-- 
Kind Regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: zram OOM behavior
  2012-11-12 13:32               ` Minchan Kim
@ 2012-11-12 14:06                 ` Mel Gorman
  2012-11-13 13:31                   ` Minchan Kim
  0 siblings, 1 reply; 20+ messages in thread
From: Mel Gorman @ 2012-11-12 14:06 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Mon, Nov 12, 2012 at 10:32:18PM +0900, Minchan Kim wrote:
> Sorry for the late reply.
> I'm still going on training course until this week so my response would be delayed, too.
> 
> > > > > > <SNIP>
> > > > > > It may be completely unnecessary to reclaim memory if the process that was
> > > > > > throttled and killed just exits quickly. As the fatal signal is pending
> > > > > > it will be able to use the pfmemalloc reserves.
> > > > > > 
> > > > > > > If he can't make forward progress with direct reclaim, he can ends up OOM path but
> > > > > > > out_of_memory checks signal check of current and allow to access reserved memory pool
> > > > > > > for quick exit and return without killing other victim selection.
> > > > > > 
> > > > > > While this is true, what advantage is there to having a killed process
> > > > > > potentially reclaiming memory it does not need to?
> > > > > 
> > > > > Killed process needs a memory for him to be terminated. I think it's not a good idea for him
> > > > > to use reserved memory pool unconditionally although he is throtlled and killed.
> > > > > Because reserved memory pool is very stricted resource for emergency so using reserved memory
> > > > > pool should be last resort after he fail to reclaim.
> > > > > 
> > > > 
> > > > Part of that reclaim can be the process reclaiming its own pages and
> > > > putting them in swap just so it can exit shortly afterwards. If it was
> > > > throttled in this path, it implies that swap-over-NFS is enabled where
> > > 
> > > Could we make sure it's only the case for swap-over-NFS?
> > 
> > The PFMEMALLOC reserves being consumed to the point of throttline is only
> > expected in the case of swap-over-network -- check the pgscan_direct_throttle
> > counter to be sure. So it's already the case that this throttling logic and
> > its signal handling is mostly a swap-over-NFS thing. It is possible that
> > a badly behaving driver using GFP_ATOMIC to allocate long-lived buffers
> > could force a situation where a process gets throttled but I'm not aware
> > of a case where this happens todays.
> 
> I saw some custom drviers in embedded side have used GFP_ATOMIC easily to protect
> avoiding deadlock.

They must be getting a lot of allocation failures in that case.

> Of course, it's not a good behavior but it lives with us.
> Even, we can't fix it because we don't have any source. :(
> 
> > 
> > > I think it can happen if the system has very slow thumb card.
> > > 
> > 
> > How? They shouldn't be stuck in throttling in this case. They should be
> > blocked on IO, congestion wait, dirty throttling etc.
> 
> Some block driver(ex, mmc) uses a thread model with PF_MEMALLOC so I think
> they can be stucked by the throttling logic.
> 

If they are using PF_MEMALLOC + GFP_ATOMIC, there is a strong chance
that they'll actually deadlock their system if there are a storm of
allocations. The drivers is fundamentally broken in a dangerous way.
None of that is fixed by forcing an exiting process to enter direct reclaim.

> > 
> > > > such reclaim in fact might require the pfmemalloc reserves to be used to
> > > > allocate network buffers. It's potentially unnecessary work because the
> > > 
> > > You mean we need pfmemalloc reserve to swap out anon pages by swap-over-NFS?
> > 
> > In very low-memory situations - yes. We can be at the min watermark but
> > still need to allocate a page for a network buffer to swap out the anon page.
> > 
> > > Yes. In this case, you're right. I would be better to use reserve pool for
> > > just exiting instead of swap out over network. But how can you make sure that
> > > we have only anonymous page when we try to reclaim? 
> > > If there are some file-backed pages, we can avoid swapout at that time.
> > > Maybe we need some check.
> > > 
> > 
> > That would be a fairly invasive set of checks for a corner case. if
> > swap-over-nfs + critically low + about to OOM + file pages available then
> > only reclaim files.
> > 
> > It's getting off track as to why we're having this discussion in the first
> > place -- looping due to improper handling of fatal signal pending.
> 
> If some user tune /proc/sys/vm/swappiness, we could have many page cache pages
> when swap-over-NFS happens.

That's a BIG if. swappiness could be anything and it'll depend on the
workload anyway.

> My point is that why do we should use emergency memory pool although we have
> reclaimalble memory?
> 

Because as I have already pointed out, the use of swap-over-nfs itself
creates more allocation pressure if it is used in the reclaim path. The
emergency memory pool is used *anyway* unless there are clean file pages
that can be discarded. But that's a big "if". The safer path is to try
and exit and if *that* fails *then* enter direct reclaim.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: zram OOM behavior
  2012-11-12 14:06                 ` Mel Gorman
@ 2012-11-13 13:31                   ` Minchan Kim
  2012-11-21 15:38                       ` Mel Gorman
  0 siblings, 1 reply; 20+ messages in thread
From: Minchan Kim @ 2012-11-13 13:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Mon, Nov 12, 2012 at 02:06:31PM +0000, Mel Gorman wrote:
> On Mon, Nov 12, 2012 at 10:32:18PM +0900, Minchan Kim wrote:
> > Sorry for the late reply.
> > I'm still going on training course until this week so my response would be delayed, too.
> > 
> > > > > > > <SNIP>
> > > > > > > It may be completely unnecessary to reclaim memory if the process that was
> > > > > > > throttled and killed just exits quickly. As the fatal signal is pending
> > > > > > > it will be able to use the pfmemalloc reserves.
> > > > > > > 
> > > > > > > > If he can't make forward progress with direct reclaim, he can ends up OOM path but
> > > > > > > > out_of_memory checks signal check of current and allow to access reserved memory pool
> > > > > > > > for quick exit and return without killing other victim selection.
> > > > > > > 
> > > > > > > While this is true, what advantage is there to having a killed process
> > > > > > > potentially reclaiming memory it does not need to?
> > > > > > 
> > > > > > Killed process needs a memory for him to be terminated. I think it's not a good idea for him
> > > > > > to use reserved memory pool unconditionally although he is throtlled and killed.
> > > > > > Because reserved memory pool is very stricted resource for emergency so using reserved memory
> > > > > > pool should be last resort after he fail to reclaim.
> > > > > > 
> > > > > 
> > > > > Part of that reclaim can be the process reclaiming its own pages and
> > > > > putting them in swap just so it can exit shortly afterwards. If it was
> > > > > throttled in this path, it implies that swap-over-NFS is enabled where
> > > > 
> > > > Could we make sure it's only the case for swap-over-NFS?
> > > 
> > > The PFMEMALLOC reserves being consumed to the point of throttline is only
> > > expected in the case of swap-over-network -- check the pgscan_direct_throttle
> > > counter to be sure. So it's already the case that this throttling logic and
> > > its signal handling is mostly a swap-over-NFS thing. It is possible that
> > > a badly behaving driver using GFP_ATOMIC to allocate long-lived buffers
> > > could force a situation where a process gets throttled but I'm not aware
> > > of a case where this happens todays.
> > 
> > I saw some custom drviers in embedded side have used GFP_ATOMIC easily to protect
> > avoiding deadlock.
> 
> They must be getting a lot of allocation failures in that case.

It depends on workload and I didn't received any report from them.

> 
> > Of course, it's not a good behavior but it lives with us.
> > Even, we can't fix it because we don't have any source. :(
> > 
> > > 
> > > > I think it can happen if the system has very slow thumb card.
> > > > 
> > > 
> > > How? They shouldn't be stuck in throttling in this case. They should be
> > > blocked on IO, congestion wait, dirty throttling etc.
> > 
> > Some block driver(ex, mmc) uses a thread model with PF_MEMALLOC so I think
> > they can be stucked by the throttling logic.
> > 
> 
> If they are using PF_MEMALLOC + GFP_ATOMIC, there is a strong chance
> that they'll actually deadlock their system if there are a storm of
> allocations. The drivers is fundamentally broken in a dangerous way.
> None of that is fixed by forcing an exiting process to enter direct reclaim.

Agreed.

> 
> > > 
> > > > > such reclaim in fact might require the pfmemalloc reserves to be used to
> > > > > allocate network buffers. It's potentially unnecessary work because the
> > > > 
> > > > You mean we need pfmemalloc reserve to swap out anon pages by swap-over-NFS?
> > > 
> > > In very low-memory situations - yes. We can be at the min watermark but
> > > still need to allocate a page for a network buffer to swap out the anon page.
> > > 
> > > > Yes. In this case, you're right. I would be better to use reserve pool for
> > > > just exiting instead of swap out over network. But how can you make sure that
> > > > we have only anonymous page when we try to reclaim? 
> > > > If there are some file-backed pages, we can avoid swapout at that time.
> > > > Maybe we need some check.
> > > > 
> > > 
> > > That would be a fairly invasive set of checks for a corner case. if
> > > swap-over-nfs + critically low + about to OOM + file pages available then
> > > only reclaim files.
> > > 
> > > It's getting off track as to why we're having this discussion in the first
> > > place -- looping due to improper handling of fatal signal pending.
> > 
> > If some user tune /proc/sys/vm/swappiness, we could have many page cache pages
> > when swap-over-NFS happens.
> 
> That's a BIG if. swappiness could be anything and it'll depend on the
> workload anyway.

Yes but we don't have to ignore such case.

> 
> > My point is that why do we should use emergency memory pool although we have
> > reclaimalble memory?
> > 
> 
> Because as I have already pointed out, the use of swap-over-nfs itself
> creates more allocation pressure if it is used in the reclaim path. The
> emergency memory pool is used *anyway* unless there are clean file pages
> that can be discarded. But that's a big "if". The safer path is to try
> and exit and if *that* fails *then* enter direct reclaim.

Okay. Let's see your code again POV side effect other than OOM deadlock problem.

1. pfmemalloc_watermark_ok == false but the process is received SIGKILL
   before calling throttle_direct_reclaim.

In this case, it enters direct reclaim path and would swap out anon pages.
It's a thing you are concerning now(ie, creates more allocation pressure)
Is it okay?

2. pfmemalloc_watermark_ok == false but the process is received SIGKILL
   while throttling.

In this case, it skips direct reclaim in first path and retry to allocate page.
If another procces free some memory or is killed, it can get a free page and
return. Yes. it would be good rather than unnecessary swap out and OOM kill.
Otherwise, it calls direct compaction again and then enter direct reclaim path.
It ends up consuming emergency memory pool to swap out anonymous pages or
OOM killed. Again, it's a thing you are concerning now.

So, your patch's effect depends on timing that other process release memory.
Is it right?
If it is your intention, I don't oppose it any more because apprantely it
has a benefit than I suggested. But please write description more clearly.
Below previous description focused only OOM deadlock problem and didn't explain
patch's side effect which I mentioned above.

[
mm: vmscan: Check for fatal signals iff the process was throttled

commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
are low and swap is backed by network storage") introduced a check for
fatal signals after a process gets throttled for network storage. The
intention was that if a process was throttled and got killed that it
should not trigger the OOM killer. As pointed out by Minchan Kim and
David Rientjes, this check is in the wrong place and too broad. If a
system is in am OOM situation and a process is exiting, it can loop in
__alloc_pages_slowpath() and calling direct reclaim in a loop. As the
fatal signal is pending it returns 1 as if it is making forward progress
and can effectively deadlock.

This patch moves the fatal_signal_pending() check after throttling to
throttle_direct_reclaim() where it belongs.

If this patch passes review it should be considered a -stable candidate
for 3.6.
]

> 
> -- 
> Mel Gorman
> SUSE Labs

-- 
Kind Regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] mm: vmscan: Check for fatal signals iff the process was throttled
  2012-11-13 13:31                   ` Minchan Kim
@ 2012-11-21 15:38                       ` Mel Gorman
  0 siblings, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2012-11-21 15:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Luigi Semenzato, linux-mm, LKML, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao, Minchan Kim

commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
are low and swap is backed by network storage") introduced a check for
fatal signals after a process gets throttled for network storage. The
intention was that if a process was throttled and got killed that it
should not trigger the OOM killer. As pointed out by Minchan Kim and
David Rientjes, this check is in the wrong place and too broad. If a
system is in am OOM situation and a process is exiting, it can loop in
__alloc_pages_slowpath() and calling direct reclaim in a loop. As the
fatal signal is pending it returns 1 as if it is making forward progress
and can effectively deadlock.

This patch moves the fatal_signal_pending() check after throttling to
throttle_direct_reclaim() where it belongs. If the process is killed
while throttled, it will return immediately without direct reclaim
except now it will have TIF_MEMDIE set and will use the PFMEMALLOC
reserves.

Minchan pointed out that it may be better to direct reclaim before returning
to avoid using the reserves because there may be pages that can easily
reclaim that would avoid using the reserves. However, we do no such targetted
reclaim and there is no guarantee that suitable pages are available. As it
is expected that this throttling happens when swap-over-NFS is used there
is a possibility that the process will instead swap which may allocate
network buffers from the PFMEMALLOC reserves. Hence, in the swap-over-nfs
case where a process can be throtted and be killed it can use the reserves
to exit or it can potentially use reserves to swap a few pages and then
exit. This patch takes the option of using the reserves if necessary to
allow the process exit quickly.

If this patch passes review it should be considered a -stable candidate
for 3.6.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   37 +++++++++++++++++++++++++++----------
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 48550c6..cbf84e1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2207,9 +2207,12 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
  * Throttle direct reclaimers if backing storage is backed by the network
  * and the PFMEMALLOC reserve for the preferred node is getting dangerously
  * depleted. kswapd will continue to make progress and wake the processes
- * when the low watermark is reached
+ * when the low watermark is reached.
+ *
+ * Returns true if a fatal signal was delivered during throttling. If this
+ * happens, the page allocator should not consider triggering the OOM killer.
  */
-static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
+static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
 					nodemask_t *nodemask)
 {
 	struct zone *zone;
@@ -2224,13 +2227,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
 	 * processes to block on log_wait_commit().
 	 */
 	if (current->flags & PF_KTHREAD)
-		return;
+		goto out;
+
+	/*
+	 * If a fatal signal is pending, this process should not throttle.
+	 * It should return quickly so it can exit and free its memory
+	 */
+	if (fatal_signal_pending(current))
+		goto out;
 
 	/* Check if the pfmemalloc reserves are ok */
 	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
 	pgdat = zone->zone_pgdat;
 	if (pfmemalloc_watermark_ok(pgdat))
-		return;
+		goto out;
 
 	/* Account for the throttling */
 	count_vm_event(PGSCAN_DIRECT_THROTTLE);
@@ -2246,12 +2256,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
 	if (!(gfp_mask & __GFP_FS)) {
 		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
 			pfmemalloc_watermark_ok(pgdat), HZ);
-		return;
+
+		goto check_pending;
 	}
 
 	/* Throttle until kswapd wakes the process */
 	wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
 		pfmemalloc_watermark_ok(pgdat));
+
+check_pending:
+	if (fatal_signal_pending(current))
+		return true;
+
+out:
+	return false;
 }
 
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
@@ -2273,13 +2291,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.gfp_mask = sc.gfp_mask,
 	};
 
-	throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
-
 	/*
-	 * Do not enter reclaim if fatal signal is pending. 1 is returned so
-	 * that the page allocator does not consider triggering OOM
+	 * Do not enter reclaim if fatal signal was delivered while throttled.
+	 * 1 is returned so that the page allocator does not OOM kill at this
+	 * point.
 	 */
-	if (fatal_signal_pending(current))
+	if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask))
 		return 1;
 
 	trace_mm_vmscan_direct_reclaim_begin(order,

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH] mm: vmscan: Check for fatal signals iff the process was throttled
@ 2012-11-21 15:38                       ` Mel Gorman
  0 siblings, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2012-11-21 15:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Luigi Semenzato, linux-mm, LKML, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao, Minchan Kim

commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
are low and swap is backed by network storage") introduced a check for
fatal signals after a process gets throttled for network storage. The
intention was that if a process was throttled and got killed that it
should not trigger the OOM killer. As pointed out by Minchan Kim and
David Rientjes, this check is in the wrong place and too broad. If a
system is in am OOM situation and a process is exiting, it can loop in
__alloc_pages_slowpath() and calling direct reclaim in a loop. As the
fatal signal is pending it returns 1 as if it is making forward progress
and can effectively deadlock.

This patch moves the fatal_signal_pending() check after throttling to
throttle_direct_reclaim() where it belongs. If the process is killed
while throttled, it will return immediately without direct reclaim
except now it will have TIF_MEMDIE set and will use the PFMEMALLOC
reserves.

Minchan pointed out that it may be better to direct reclaim before returning
to avoid using the reserves because there may be pages that can easily
reclaim that would avoid using the reserves. However, we do no such targetted
reclaim and there is no guarantee that suitable pages are available. As it
is expected that this throttling happens when swap-over-NFS is used there
is a possibility that the process will instead swap which may allocate
network buffers from the PFMEMALLOC reserves. Hence, in the swap-over-nfs
case where a process can be throtted and be killed it can use the reserves
to exit or it can potentially use reserves to swap a few pages and then
exit. This patch takes the option of using the reserves if necessary to
allow the process exit quickly.

If this patch passes review it should be considered a -stable candidate
for 3.6.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   37 +++++++++++++++++++++++++++----------
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 48550c6..cbf84e1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2207,9 +2207,12 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
  * Throttle direct reclaimers if backing storage is backed by the network
  * and the PFMEMALLOC reserve for the preferred node is getting dangerously
  * depleted. kswapd will continue to make progress and wake the processes
- * when the low watermark is reached
+ * when the low watermark is reached.
+ *
+ * Returns true if a fatal signal was delivered during throttling. If this
+ * happens, the page allocator should not consider triggering the OOM killer.
  */
-static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
+static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
 					nodemask_t *nodemask)
 {
 	struct zone *zone;
@@ -2224,13 +2227,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
 	 * processes to block on log_wait_commit().
 	 */
 	if (current->flags & PF_KTHREAD)
-		return;
+		goto out;
+
+	/*
+	 * If a fatal signal is pending, this process should not throttle.
+	 * It should return quickly so it can exit and free its memory
+	 */
+	if (fatal_signal_pending(current))
+		goto out;
 
 	/* Check if the pfmemalloc reserves are ok */
 	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
 	pgdat = zone->zone_pgdat;
 	if (pfmemalloc_watermark_ok(pgdat))
-		return;
+		goto out;
 
 	/* Account for the throttling */
 	count_vm_event(PGSCAN_DIRECT_THROTTLE);
@@ -2246,12 +2256,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
 	if (!(gfp_mask & __GFP_FS)) {
 		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
 			pfmemalloc_watermark_ok(pgdat), HZ);
-		return;
+
+		goto check_pending;
 	}
 
 	/* Throttle until kswapd wakes the process */
 	wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
 		pfmemalloc_watermark_ok(pgdat));
+
+check_pending:
+	if (fatal_signal_pending(current))
+		return true;
+
+out:
+	return false;
 }
 
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
@@ -2273,13 +2291,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.gfp_mask = sc.gfp_mask,
 	};
 
-	throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
-
 	/*
-	 * Do not enter reclaim if fatal signal is pending. 1 is returned so
-	 * that the page allocator does not consider triggering OOM
+	 * Do not enter reclaim if fatal signal was delivered while throttled.
+	 * 1 is returned so that the page allocator does not OOM kill at this
+	 * point.
 	 */
-	if (fatal_signal_pending(current))
+	if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask))
 		return 1;
 
 	trace_mm_vmscan_direct_reclaim_begin(order,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: vmscan: Check for fatal signals iff the process was throttled
  2012-11-21 15:38                       ` Mel Gorman
@ 2012-11-21 20:15                         ` Andrew Morton
  -1 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2012-11-21 20:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Luigi Semenzato, linux-mm, LKML, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao, Minchan Kim

On Wed, 21 Nov 2012 15:38:24 +0000
Mel Gorman <mgorman@suse.de> wrote:

> commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
> are low and swap is backed by network storage") introduced a check for
> fatal signals after a process gets throttled for network storage. The
> intention was that if a process was throttled and got killed that it
> should not trigger the OOM killer. As pointed out by Minchan Kim and
> David Rientjes, this check is in the wrong place and too broad. If a
> system is in am OOM situation and a process is exiting, it can loop in
> __alloc_pages_slowpath() and calling direct reclaim in a loop. As the
> fatal signal is pending it returns 1 as if it is making forward progress
> and can effectively deadlock.
> 
> This patch moves the fatal_signal_pending() check after throttling to
> throttle_direct_reclaim() where it belongs. If the process is killed
> while throttled, it will return immediately without direct reclaim
> except now it will have TIF_MEMDIE set and will use the PFMEMALLOC
> reserves.
> 
> Minchan pointed out that it may be better to direct reclaim before returning
> to avoid using the reserves because there may be pages that can easily
> reclaim that would avoid using the reserves. However, we do no such targetted
> reclaim and there is no guarantee that suitable pages are available. As it
> is expected that this throttling happens when swap-over-NFS is used there
> is a possibility that the process will instead swap which may allocate
> network buffers from the PFMEMALLOC reserves. Hence, in the swap-over-nfs
> case where a process can be throtted and be killed it can use the reserves
> to exit or it can potentially use reserves to swap a few pages and then
> exit. This patch takes the option of using the reserves if necessary to
> allow the process exit quickly.
> 
> If this patch passes review it should be considered a -stable candidate
> for 3.6.
> 
> ...
>
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2207,9 +2207,12 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>   * Throttle direct reclaimers if backing storage is backed by the network
>   * and the PFMEMALLOC reserve for the preferred node is getting dangerously
>   * depleted. kswapd will continue to make progress and wake the processes
> - * when the low watermark is reached
> + * when the low watermark is reached.
> + *
> + * Returns true if a fatal signal was delivered during throttling. If this

s/delivered/received/imo

> + * happens, the page allocator should not consider triggering the OOM killer.
>   */
> -static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> +static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>  					nodemask_t *nodemask)
>  {
>  	struct zone *zone;
> @@ -2224,13 +2227,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>  	 * processes to block on log_wait_commit().
>  	 */
>  	if (current->flags & PF_KTHREAD)
> -		return;
> +		goto out;

hm, well, back in the old days some kernel threads were killable via
signals.  They had to opt-in to it by diddling their signal masks and a
few other things.  Too lazy to check if there are still any such sites.


> +	/*
> +	 * If a fatal signal is pending, this process should not throttle.
> +	 * It should return quickly so it can exit and free its memory
> +	 */
> +	if (fatal_signal_pending(current))
> +		goto out;

theresabug.  It should return "true" here.

>  
>  	/* Check if the pfmemalloc reserves are ok */
>  	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
>  	pgdat = zone->zone_pgdat;
>  	if (pfmemalloc_watermark_ok(pgdat))
> -		return;
> +		goto out;
>  
>  	/* Account for the throttling */
>  	count_vm_event(PGSCAN_DIRECT_THROTTLE);
> @@ -2246,12 +2256,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>  	if (!(gfp_mask & __GFP_FS)) {
>  		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
>  			pfmemalloc_watermark_ok(pgdat), HZ);
> -		return;
> +
> +		goto check_pending;

And this can be just an "else".

>  	}
>  
>  	/* Throttle until kswapd wakes the process */
>  	wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
>  		pfmemalloc_watermark_ok(pgdat));
> +
> +check_pending:
> +	if (fatal_signal_pending(current))
> +		return true;
> +
> +out:
> +	return false;
>  }
>  
>  unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> @@ -2273,13 +2291,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  		.gfp_mask = sc.gfp_mask,
>  	};
>  
> -	throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
> -
>  	/*
> -	 * Do not enter reclaim if fatal signal is pending. 1 is returned so
> -	 * that the page allocator does not consider triggering OOM
> +	 * Do not enter reclaim if fatal signal was delivered while throttled.

Again, "received" is clearer.

> +	 * 1 is returned so that the page allocator does not OOM kill at this
> +	 * point.
>  	 */
> -	if (fatal_signal_pending(current))
> +	if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask))
>  		return 1;
>  
>  	trace_mm_vmscan_direct_reclaim_begin(order,

So I end up with the below patch, which yields

static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
					nodemask_t *nodemask)
{
	struct zone *zone;
	int high_zoneidx = gfp_zone(gfp_mask);
	pg_data_t *pgdat;

	/*
	 * Kernel threads should not be throttled as they may be indirectly
	 * responsible for cleaning pages necessary for reclaim to make forward
	 * progress. kjournald for example may enter direct reclaim while
	 * committing a transaction where throttling it could force other
	 * processes to block on log_wait_commit().
	 */
	if (current->flags & PF_KTHREAD)
		goto out;

	/*
	 * If a fatal signal is pending, this process should not throttle.
	 * It should return quickly so it can exit and free its memory
	 */
	if (fatal_signal_pending(current))
		goto killed;

	/* Check if the pfmemalloc reserves are ok */
	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
	pgdat = zone->zone_pgdat;
	if (pfmemalloc_watermark_ok(pgdat))
		goto out;

	/* Account for the throttling */
	count_vm_event(PGSCAN_DIRECT_THROTTLE);

	/*
	 * If the caller cannot enter the filesystem, it's possible that it
	 * is due to the caller holding an FS lock or performing a journal
	 * transaction in the case of a filesystem like ext[3|4]. In this case,
	 * it is not safe to block on pfmemalloc_wait as kswapd could be
	 * blocked waiting on the same lock. Instead, throttle for up to a
	 * second before continuing.
	 */
	if (!(gfp_mask & __GFP_FS)) {
		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
					pfmemalloc_watermark_ok(pgdat), HZ);
	} else {
		/* Throttle until kswapd wakes the process */
		wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
				    pfmemalloc_watermark_ok(pgdat));
	}

	if (fatal_signal_pending(current)) {
killed:
		return true;
	}

out:
	return false;
}

(I hate that "goto killed" thing, but can't think of a better way)

--- a/mm/vmscan.c~mm-vmscan-check-for-fatal-signals-iff-the-process-was-throttled-fix
+++ a/mm/vmscan.c
@@ -2209,7 +2209,7 @@ static bool pfmemalloc_watermark_ok(pg_d
  * depleted. kswapd will continue to make progress and wake the processes
  * when the low watermark is reached.
  *
- * Returns true if a fatal signal was delivered during throttling. If this
+ * Returns true if a fatal signal was received during throttling.  If this
  * happens, the page allocator should not consider triggering the OOM killer.
  */
 static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
@@ -2223,7 +2223,7 @@ static bool throttle_direct_reclaim(gfp_
 	 * Kernel threads should not be throttled as they may be indirectly
 	 * responsible for cleaning pages necessary for reclaim to make forward
 	 * progress. kjournald for example may enter direct reclaim while
-	 * committing a transaction where throttling it could forcing other
+	 * committing a transaction where throttling it could force other
 	 * processes to block on log_wait_commit().
 	 */
 	if (current->flags & PF_KTHREAD)
@@ -2234,7 +2234,7 @@ static bool throttle_direct_reclaim(gfp_
 	 * It should return quickly so it can exit and free its memory
 	 */
 	if (fatal_signal_pending(current))
-		goto out;
+		goto killed;
 
 	/* Check if the pfmemalloc reserves are ok */
 	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
@@ -2255,18 +2255,17 @@ static bool throttle_direct_reclaim(gfp_
 	 */
 	if (!(gfp_mask & __GFP_FS)) {
 		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
-			pfmemalloc_watermark_ok(pgdat), HZ);
-
-		goto check_pending;
+					pfmemalloc_watermark_ok(pgdat), HZ);
+	} else {
+		/* Throttle until kswapd wakes the process */
+		wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
+				    pfmemalloc_watermark_ok(pgdat));
 	}
 
-	/* Throttle until kswapd wakes the process */
-	wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
-		pfmemalloc_watermark_ok(pgdat));
-
-check_pending:
-	if (fatal_signal_pending(current))
+	if (fatal_signal_pending(current)) {
+killed:
 		return true;
+	}
 
 out:
 	return false;
@@ -2292,7 +2291,7 @@ unsigned long try_to_free_pages(struct z
 	};
 
 	/*
-	 * Do not enter reclaim if fatal signal was delivered while throttled.
+	 * Do not enter reclaim if a fatal signal was received while throttled.
 	 * 1 is returned so that the page allocator does not OOM kill at this
 	 * point.
 	 */
_


(Still hating that "goto killed")

(relents)

How about this version?

static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
					nodemask_t *nodemask)
{
	struct zone *zone;
	int high_zoneidx = gfp_zone(gfp_mask);
	pg_data_t *pgdat;

	/*
	 * Kernel threads should not be throttled as they may be indirectly
	 * responsible for cleaning pages necessary for reclaim to make forward
	 * progress. kjournald for example may enter direct reclaim while
	 * committing a transaction where throttling it could force other
	 * processes to block on log_wait_commit().
	 */
	if (current->flags & PF_KTHREAD)
		return false;

	/*
	 * If a fatal signal is pending, this process should not throttle.
	 * It should return quickly so it can exit and free its memory
	 */
	if (fatal_signal_pending(current))
		return true;

	/* Check if the pfmemalloc reserves are ok */
	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
	pgdat = zone->zone_pgdat;
	if (pfmemalloc_watermark_ok(pgdat))
		return false;

	/* Account for the throttling */
	count_vm_event(PGSCAN_DIRECT_THROTTLE);

	/*
	 * If the caller cannot enter the filesystem, it's possible that it
	 * is due to the caller holding an FS lock or performing a journal
	 * transaction in the case of a filesystem like ext[3|4]. In this case,
	 * it is not safe to block on pfmemalloc_wait as kswapd could be
	 * blocked waiting on the same lock. Instead, throttle for up to a
	 * second before continuing.
	 */
	if (!(gfp_mask & __GFP_FS)) {
		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
					pfmemalloc_watermark_ok(pgdat), HZ);
	} else {
		/* Throttle until kswapd wakes the process */
		wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
				    pfmemalloc_watermark_ok(pgdat));
	}

	return fatal_signal_pending(current);
}


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: vmscan: Check for fatal signals iff the process was throttled
@ 2012-11-21 20:15                         ` Andrew Morton
  0 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2012-11-21 20:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Luigi Semenzato, linux-mm, LKML, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao, Minchan Kim

On Wed, 21 Nov 2012 15:38:24 +0000
Mel Gorman <mgorman@suse.de> wrote:

> commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
> are low and swap is backed by network storage") introduced a check for
> fatal signals after a process gets throttled for network storage. The
> intention was that if a process was throttled and got killed that it
> should not trigger the OOM killer. As pointed out by Minchan Kim and
> David Rientjes, this check is in the wrong place and too broad. If a
> system is in am OOM situation and a process is exiting, it can loop in
> __alloc_pages_slowpath() and calling direct reclaim in a loop. As the
> fatal signal is pending it returns 1 as if it is making forward progress
> and can effectively deadlock.
> 
> This patch moves the fatal_signal_pending() check after throttling to
> throttle_direct_reclaim() where it belongs. If the process is killed
> while throttled, it will return immediately without direct reclaim
> except now it will have TIF_MEMDIE set and will use the PFMEMALLOC
> reserves.
> 
> Minchan pointed out that it may be better to direct reclaim before returning
> to avoid using the reserves because there may be pages that can easily
> reclaim that would avoid using the reserves. However, we do no such targetted
> reclaim and there is no guarantee that suitable pages are available. As it
> is expected that this throttling happens when swap-over-NFS is used there
> is a possibility that the process will instead swap which may allocate
> network buffers from the PFMEMALLOC reserves. Hence, in the swap-over-nfs
> case where a process can be throtted and be killed it can use the reserves
> to exit or it can potentially use reserves to swap a few pages and then
> exit. This patch takes the option of using the reserves if necessary to
> allow the process exit quickly.
> 
> If this patch passes review it should be considered a -stable candidate
> for 3.6.
> 
> ...
>
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2207,9 +2207,12 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>   * Throttle direct reclaimers if backing storage is backed by the network
>   * and the PFMEMALLOC reserve for the preferred node is getting dangerously
>   * depleted. kswapd will continue to make progress and wake the processes
> - * when the low watermark is reached
> + * when the low watermark is reached.
> + *
> + * Returns true if a fatal signal was delivered during throttling. If this

s/delivered/received/imo

> + * happens, the page allocator should not consider triggering the OOM killer.
>   */
> -static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> +static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>  					nodemask_t *nodemask)
>  {
>  	struct zone *zone;
> @@ -2224,13 +2227,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>  	 * processes to block on log_wait_commit().
>  	 */
>  	if (current->flags & PF_KTHREAD)
> -		return;
> +		goto out;

hm, well, back in the old days some kernel threads were killable via
signals.  They had to opt-in to it by diddling their signal masks and a
few other things.  Too lazy to check if there are still any such sites.


> +	/*
> +	 * If a fatal signal is pending, this process should not throttle.
> +	 * It should return quickly so it can exit and free its memory
> +	 */
> +	if (fatal_signal_pending(current))
> +		goto out;

theresabug.  It should return "true" here.

>  
>  	/* Check if the pfmemalloc reserves are ok */
>  	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
>  	pgdat = zone->zone_pgdat;
>  	if (pfmemalloc_watermark_ok(pgdat))
> -		return;
> +		goto out;
>  
>  	/* Account for the throttling */
>  	count_vm_event(PGSCAN_DIRECT_THROTTLE);
> @@ -2246,12 +2256,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>  	if (!(gfp_mask & __GFP_FS)) {
>  		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
>  			pfmemalloc_watermark_ok(pgdat), HZ);
> -		return;
> +
> +		goto check_pending;

And this can be just an "else".

>  	}
>  
>  	/* Throttle until kswapd wakes the process */
>  	wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
>  		pfmemalloc_watermark_ok(pgdat));
> +
> +check_pending:
> +	if (fatal_signal_pending(current))
> +		return true;
> +
> +out:
> +	return false;
>  }
>  
>  unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> @@ -2273,13 +2291,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  		.gfp_mask = sc.gfp_mask,
>  	};
>  
> -	throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
> -
>  	/*
> -	 * Do not enter reclaim if fatal signal is pending. 1 is returned so
> -	 * that the page allocator does not consider triggering OOM
> +	 * Do not enter reclaim if fatal signal was delivered while throttled.

Again, "received" is clearer.

> +	 * 1 is returned so that the page allocator does not OOM kill at this
> +	 * point.
>  	 */
> -	if (fatal_signal_pending(current))
> +	if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask))
>  		return 1;
>  
>  	trace_mm_vmscan_direct_reclaim_begin(order,

So I end up with the below patch, which yields

static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
					nodemask_t *nodemask)
{
	struct zone *zone;
	int high_zoneidx = gfp_zone(gfp_mask);
	pg_data_t *pgdat;

	/*
	 * Kernel threads should not be throttled as they may be indirectly
	 * responsible for cleaning pages necessary for reclaim to make forward
	 * progress. kjournald for example may enter direct reclaim while
	 * committing a transaction where throttling it could force other
	 * processes to block on log_wait_commit().
	 */
	if (current->flags & PF_KTHREAD)
		goto out;

	/*
	 * If a fatal signal is pending, this process should not throttle.
	 * It should return quickly so it can exit and free its memory
	 */
	if (fatal_signal_pending(current))
		goto killed;

	/* Check if the pfmemalloc reserves are ok */
	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
	pgdat = zone->zone_pgdat;
	if (pfmemalloc_watermark_ok(pgdat))
		goto out;

	/* Account for the throttling */
	count_vm_event(PGSCAN_DIRECT_THROTTLE);

	/*
	 * If the caller cannot enter the filesystem, it's possible that it
	 * is due to the caller holding an FS lock or performing a journal
	 * transaction in the case of a filesystem like ext[3|4]. In this case,
	 * it is not safe to block on pfmemalloc_wait as kswapd could be
	 * blocked waiting on the same lock. Instead, throttle for up to a
	 * second before continuing.
	 */
	if (!(gfp_mask & __GFP_FS)) {
		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
					pfmemalloc_watermark_ok(pgdat), HZ);
	} else {
		/* Throttle until kswapd wakes the process */
		wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
				    pfmemalloc_watermark_ok(pgdat));
	}

	if (fatal_signal_pending(current)) {
killed:
		return true;
	}

out:
	return false;
}

(I hate that "goto killed" thing, but can't think of a better way)

--- a/mm/vmscan.c~mm-vmscan-check-for-fatal-signals-iff-the-process-was-throttled-fix
+++ a/mm/vmscan.c
@@ -2209,7 +2209,7 @@ static bool pfmemalloc_watermark_ok(pg_d
  * depleted. kswapd will continue to make progress and wake the processes
  * when the low watermark is reached.
  *
- * Returns true if a fatal signal was delivered during throttling. If this
+ * Returns true if a fatal signal was received during throttling.  If this
  * happens, the page allocator should not consider triggering the OOM killer.
  */
 static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
@@ -2223,7 +2223,7 @@ static bool throttle_direct_reclaim(gfp_
 	 * Kernel threads should not be throttled as they may be indirectly
 	 * responsible for cleaning pages necessary for reclaim to make forward
 	 * progress. kjournald for example may enter direct reclaim while
-	 * committing a transaction where throttling it could forcing other
+	 * committing a transaction where throttling it could force other
 	 * processes to block on log_wait_commit().
 	 */
 	if (current->flags & PF_KTHREAD)
@@ -2234,7 +2234,7 @@ static bool throttle_direct_reclaim(gfp_
 	 * It should return quickly so it can exit and free its memory
 	 */
 	if (fatal_signal_pending(current))
-		goto out;
+		goto killed;
 
 	/* Check if the pfmemalloc reserves are ok */
 	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
@@ -2255,18 +2255,17 @@ static bool throttle_direct_reclaim(gfp_
 	 */
 	if (!(gfp_mask & __GFP_FS)) {
 		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
-			pfmemalloc_watermark_ok(pgdat), HZ);
-
-		goto check_pending;
+					pfmemalloc_watermark_ok(pgdat), HZ);
+	} else {
+		/* Throttle until kswapd wakes the process */
+		wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
+				    pfmemalloc_watermark_ok(pgdat));
 	}
 
-	/* Throttle until kswapd wakes the process */
-	wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
-		pfmemalloc_watermark_ok(pgdat));
-
-check_pending:
-	if (fatal_signal_pending(current))
+	if (fatal_signal_pending(current)) {
+killed:
 		return true;
+	}
 
 out:
 	return false;
@@ -2292,7 +2291,7 @@ unsigned long try_to_free_pages(struct z
 	};
 
 	/*
-	 * Do not enter reclaim if fatal signal was delivered while throttled.
+	 * Do not enter reclaim if a fatal signal was received while throttled.
 	 * 1 is returned so that the page allocator does not OOM kill at this
 	 * point.
 	 */
_


(Still hating that "goto killed")

(relents)

How about this version?

static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
					nodemask_t *nodemask)
{
	struct zone *zone;
	int high_zoneidx = gfp_zone(gfp_mask);
	pg_data_t *pgdat;

	/*
	 * Kernel threads should not be throttled as they may be indirectly
	 * responsible for cleaning pages necessary for reclaim to make forward
	 * progress. kjournald for example may enter direct reclaim while
	 * committing a transaction where throttling it could force other
	 * processes to block on log_wait_commit().
	 */
	if (current->flags & PF_KTHREAD)
		return false;

	/*
	 * If a fatal signal is pending, this process should not throttle.
	 * It should return quickly so it can exit and free its memory
	 */
	if (fatal_signal_pending(current))
		return true;

	/* Check if the pfmemalloc reserves are ok */
	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
	pgdat = zone->zone_pgdat;
	if (pfmemalloc_watermark_ok(pgdat))
		return false;

	/* Account for the throttling */
	count_vm_event(PGSCAN_DIRECT_THROTTLE);

	/*
	 * If the caller cannot enter the filesystem, it's possible that it
	 * is due to the caller holding an FS lock or performing a journal
	 * transaction in the case of a filesystem like ext[3|4]. In this case,
	 * it is not safe to block on pfmemalloc_wait as kswapd could be
	 * blocked waiting on the same lock. Instead, throttle for up to a
	 * second before continuing.
	 */
	if (!(gfp_mask & __GFP_FS)) {
		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
					pfmemalloc_watermark_ok(pgdat), HZ);
	} else {
		/* Throttle until kswapd wakes the process */
		wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
				    pfmemalloc_watermark_ok(pgdat));
	}

	return fatal_signal_pending(current);
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: vmscan: Check for fatal signals iff the process was throttled
  2012-11-21 20:15                         ` Andrew Morton
@ 2012-11-21 21:05                           ` Mel Gorman
  -1 siblings, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2012-11-21 21:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Luigi Semenzato, linux-mm, LKML, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao, Minchan Kim

On Wed, Nov 21, 2012 at 12:15:59PM -0800, Andrew Morton wrote:
> On Wed, 21 Nov 2012 15:38:24 +0000
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
> > are low and swap is backed by network storage") introduced a check for
> > fatal signals after a process gets throttled for network storage. The
> > intention was that if a process was throttled and got killed that it
> > should not trigger the OOM killer. As pointed out by Minchan Kim and
> > David Rientjes, this check is in the wrong place and too broad. If a
> > system is in am OOM situation and a process is exiting, it can loop in
> > __alloc_pages_slowpath() and calling direct reclaim in a loop. As the
> > fatal signal is pending it returns 1 as if it is making forward progress
> > and can effectively deadlock.
> > 
> > This patch moves the fatal_signal_pending() check after throttling to
> > throttle_direct_reclaim() where it belongs. If the process is killed
> > while throttled, it will return immediately without direct reclaim
> > except now it will have TIF_MEMDIE set and will use the PFMEMALLOC
> > reserves.
> > 
> > Minchan pointed out that it may be better to direct reclaim before returning
> > to avoid using the reserves because there may be pages that can easily
> > reclaim that would avoid using the reserves. However, we do no such targetted
> > reclaim and there is no guarantee that suitable pages are available. As it
> > is expected that this throttling happens when swap-over-NFS is used there
> > is a possibility that the process will instead swap which may allocate
> > network buffers from the PFMEMALLOC reserves. Hence, in the swap-over-nfs
> > case where a process can be throtted and be killed it can use the reserves
> > to exit or it can potentially use reserves to swap a few pages and then
> > exit. This patch takes the option of using the reserves if necessary to
> > allow the process exit quickly.
> > 
> > If this patch passes review it should be considered a -stable candidate
> > for 3.6.
> > 
> > ...
> >
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2207,9 +2207,12 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> >   * Throttle direct reclaimers if backing storage is backed by the network
> >   * and the PFMEMALLOC reserve for the preferred node is getting dangerously
> >   * depleted. kswapd will continue to make progress and wake the processes
> > - * when the low watermark is reached
> > + * when the low watermark is reached.
> > + *
> > + * Returns true if a fatal signal was delivered during throttling. If this
> 
> s/delivered/received/imo
> 

Ok.

> > + * happens, the page allocator should not consider triggering the OOM killer.
> >   */
> > -static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> > +static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> >  					nodemask_t *nodemask)
> >  {
> >  	struct zone *zone;
> > @@ -2224,13 +2227,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> >  	 * processes to block on log_wait_commit().
> >  	 */
> >  	if (current->flags & PF_KTHREAD)
> > -		return;
> > +		goto out;
> 
> hm, well, back in the old days some kernel threads were killable via
> signals.  They had to opt-in to it by diddling their signal masks and a
> few other things.  Too lazy to check if there are still any such sites.
> 

That check is against throttling rather than signal handling though. It
could have been just left as "return".

> 
> > +	/*
> > +	 * If a fatal signal is pending, this process should not throttle.
> > +	 * It should return quickly so it can exit and free its memory
> > +	 */
> > +	if (fatal_signal_pending(current))
> > +		goto out;
> 
> theresabug.  It should return "true" here.
> 

The intention here is that a process would

1. allocate, fail, enter direct reclaim
2. no signal pending, gets throttled because of low pfmemalloc reserves
3. a user kills -9 the throttled process. returns true and goes back
   to the page allocator
4. If that allocation fails again, it re-enters direct reclaim and tries
   to throttle. This time the fatal signal is pending but we know
   we must have already failed to make the allocation so this time false
   is rurned by throttle_direct_reclaim and it tries direct reclaim.
5. direct reclaim frees something -- probably clean file-backed pages
   if the last allocation attempt had failed.

so the fatal signal check should only prevent entering direct reclaim
once. Maybe the comment sucks

/*
 * If a fatal signal is pending, this process should not throttle.
 * It should return quickly so it can exit and free its memory. Note
 * that returning false here allows a process to enter direct reclaim.
 * Otherwise there is a risk that the process loops in the page
 * allocator, checking signals and never making forward progress
 */

?

> >  
> >  	/* Check if the pfmemalloc reserves are ok */
> >  	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
> >  	pgdat = zone->zone_pgdat;
> >  	if (pfmemalloc_watermark_ok(pgdat))
> > -		return;
> > +		goto out;
> >  
> >  	/* Account for the throttling */
> >  	count_vm_event(PGSCAN_DIRECT_THROTTLE);
> > @@ -2246,12 +2256,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> >  	if (!(gfp_mask & __GFP_FS)) {
> >  		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
> >  			pfmemalloc_watermark_ok(pgdat), HZ);
> > -		return;
> > +
> > +		goto check_pending;
> 
> And this can be just an "else".
> 

ok.

> >  	}
> >  
> >  	/* Throttle until kswapd wakes the process */
> >  	wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
> >  		pfmemalloc_watermark_ok(pgdat));
> > +
> > +check_pending:
> > +	if (fatal_signal_pending(current))
> > +		return true;
> > +
> > +out:
> > +	return false;
> >  }
> >  
> >  unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> > @@ -2273,13 +2291,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> >  		.gfp_mask = sc.gfp_mask,
> >  	};
> >  
> > -	throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
> > -
> >  	/*
> > -	 * Do not enter reclaim if fatal signal is pending. 1 is returned so
> > -	 * that the page allocator does not consider triggering OOM
> > +	 * Do not enter reclaim if fatal signal was delivered while throttled.
> 
> Again, "received" is clearer.
> 

It is.

> > +	 * 1 is returned so that the page allocator does not OOM kill at this
> > +	 * point.
> >  	 */
> > -	if (fatal_signal_pending(current))
> > +	if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask))
> >  		return 1;
> >  
> >  	trace_mm_vmscan_direct_reclaim_begin(order,
> 
> So I end up with the below patch, which yields
> 
> static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> 					nodemask_t *nodemask)
> {
> 	struct zone *zone;
> 	int high_zoneidx = gfp_zone(gfp_mask);
> 	pg_data_t *pgdat;
> 
> 	/*
> 	 * Kernel threads should not be throttled as they may be indirectly
> 	 * responsible for cleaning pages necessary for reclaim to make forward
> 	 * progress. kjournald for example may enter direct reclaim while
> 	 * committing a transaction where throttling it could force other
> 	 * processes to block on log_wait_commit().
> 	 */
> 	if (current->flags & PF_KTHREAD)
> 		goto out;
> 
> 	/*
> 	 * If a fatal signal is pending, this process should not throttle.
> 	 * It should return quickly so it can exit and free its memory
> 	 */
> 	if (fatal_signal_pending(current))
> 		goto killed;
> 

I think there is a risk that this allows the page allocator to loop
in direct reclaim, always returning true here and never actually making
forward progress. Did I miss something?

> 	/* Check if the pfmemalloc reserves are ok */
> 	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
> 	pgdat = zone->zone_pgdat;
> 	if (pfmemalloc_watermark_ok(pgdat))
> 		goto out;
> 
> 	/* Account for the throttling */
> 	count_vm_event(PGSCAN_DIRECT_THROTTLE);
> 
> 	/*
> 	 * If the caller cannot enter the filesystem, it's possible that it
> 	 * is due to the caller holding an FS lock or performing a journal
> 	 * transaction in the case of a filesystem like ext[3|4]. In this case,
> 	 * it is not safe to block on pfmemalloc_wait as kswapd could be
> 	 * blocked waiting on the same lock. Instead, throttle for up to a
> 	 * second before continuing.
> 	 */
> 	if (!(gfp_mask & __GFP_FS)) {
> 		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
> 					pfmemalloc_watermark_ok(pgdat), HZ);
> 	} else {
> 		/* Throttle until kswapd wakes the process */
> 		wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
> 				    pfmemalloc_watermark_ok(pgdat));
> 	}
> 
> 	if (fatal_signal_pending(current)) {
> killed:
> 		return true;
> 	}
> 
> out:
> 	return false;
> }
> 
> (I hate that "goto killed" thing, but can't think of a better way)
> 
> --- a/mm/vmscan.c~mm-vmscan-check-for-fatal-signals-iff-the-process-was-throttled-fix
> +++ a/mm/vmscan.c
> @@ -2209,7 +2209,7 @@ static bool pfmemalloc_watermark_ok(pg_d
>   * depleted. kswapd will continue to make progress and wake the processes
>   * when the low watermark is reached.
>   *
> - * Returns true if a fatal signal was delivered during throttling. If this
> + * Returns true if a fatal signal was received during throttling.  If this
>   * happens, the page allocator should not consider triggering the OOM killer.
>   */
>  static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> @@ -2223,7 +2223,7 @@ static bool throttle_direct_reclaim(gfp_
>  	 * Kernel threads should not be throttled as they may be indirectly
>  	 * responsible for cleaning pages necessary for reclaim to make forward
>  	 * progress. kjournald for example may enter direct reclaim while
> -	 * committing a transaction where throttling it could forcing other
> +	 * committing a transaction where throttling it could force other
>  	 * processes to block on log_wait_commit().
>  	 */
>  	if (current->flags & PF_KTHREAD)
> @@ -2234,7 +2234,7 @@ static bool throttle_direct_reclaim(gfp_
>  	 * It should return quickly so it can exit and free its memory
>  	 */
>  	if (fatal_signal_pending(current))
> -		goto out;
> +		goto killed;
>  
>  	/* Check if the pfmemalloc reserves are ok */
>  	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
> @@ -2255,18 +2255,17 @@ static bool throttle_direct_reclaim(gfp_
>  	 */
>  	if (!(gfp_mask & __GFP_FS)) {
>  		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
> -			pfmemalloc_watermark_ok(pgdat), HZ);
> -
> -		goto check_pending;
> +					pfmemalloc_watermark_ok(pgdat), HZ);
> +	} else {
> +		/* Throttle until kswapd wakes the process */
> +		wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
> +				    pfmemalloc_watermark_ok(pgdat));
>  	}
>  
> -	/* Throttle until kswapd wakes the process */
> -	wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
> -		pfmemalloc_watermark_ok(pgdat));
> -
> -check_pending:
> -	if (fatal_signal_pending(current))
> +	if (fatal_signal_pending(current)) {
> +killed:
>  		return true;
> +	}
>  
>  out:
>  	return false;
> @@ -2292,7 +2291,7 @@ unsigned long try_to_free_pages(struct z
>  	};
>  
>  	/*
> -	 * Do not enter reclaim if fatal signal was delivered while throttled.
> +	 * Do not enter reclaim if a fatal signal was received while throttled.
>  	 * 1 is returned so that the page allocator does not OOM kill at this
>  	 * point.
>  	 */
> _
> 
> 
> (Still hating that "goto killed")
> 
> (relents)
> 
> How about this version?
> 
> static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> 					nodemask_t *nodemask)
> {
> 	struct zone *zone;
> 	int high_zoneidx = gfp_zone(gfp_mask);
> 	pg_data_t *pgdat;
> 
> 	/*
> 	 * Kernel threads should not be throttled as they may be indirectly
> 	 * responsible for cleaning pages necessary for reclaim to make forward
> 	 * progress. kjournald for example may enter direct reclaim while
> 	 * committing a transaction where throttling it could force other
> 	 * processes to block on log_wait_commit().
> 	 */
> 	if (current->flags & PF_KTHREAD)
> 		return false;
> 
> 	/*
> 	 * If a fatal signal is pending, this process should not throttle.
> 	 * It should return quickly so it can exit and free its memory
> 	 */
> 	if (fatal_signal_pending(current))
> 		return true;
> 

Same comment about the potential looping. Otherwise I think it's ok.

> 	/* Check if the pfmemalloc reserves are ok */
> 	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
> 	pgdat = zone->zone_pgdat;
> 	if (pfmemalloc_watermark_ok(pgdat))
> 		return false;
> 
> 	/* Account for the throttling */
> 	count_vm_event(PGSCAN_DIRECT_THROTTLE);
> 
> 	/*
> 	 * If the caller cannot enter the filesystem, it's possible that it
> 	 * is due to the caller holding an FS lock or performing a journal
> 	 * transaction in the case of a filesystem like ext[3|4]. In this case,
> 	 * it is not safe to block on pfmemalloc_wait as kswapd could be
> 	 * blocked waiting on the same lock. Instead, throttle for up to a
> 	 * second before continuing.
> 	 */
> 	if (!(gfp_mask & __GFP_FS)) {
> 		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
> 					pfmemalloc_watermark_ok(pgdat), HZ);
> 	} else {
> 		/* Throttle until kswapd wakes the process */
> 		wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
> 				    pfmemalloc_watermark_ok(pgdat));
> 	}
> 
> 	return fatal_signal_pending(current);
> }
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: vmscan: Check for fatal signals iff the process was throttled
@ 2012-11-21 21:05                           ` Mel Gorman
  0 siblings, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2012-11-21 21:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Luigi Semenzato, linux-mm, LKML, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao, Minchan Kim

On Wed, Nov 21, 2012 at 12:15:59PM -0800, Andrew Morton wrote:
> On Wed, 21 Nov 2012 15:38:24 +0000
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
> > are low and swap is backed by network storage") introduced a check for
> > fatal signals after a process gets throttled for network storage. The
> > intention was that if a process was throttled and got killed that it
> > should not trigger the OOM killer. As pointed out by Minchan Kim and
> > David Rientjes, this check is in the wrong place and too broad. If a
> > system is in am OOM situation and a process is exiting, it can loop in
> > __alloc_pages_slowpath() and calling direct reclaim in a loop. As the
> > fatal signal is pending it returns 1 as if it is making forward progress
> > and can effectively deadlock.
> > 
> > This patch moves the fatal_signal_pending() check after throttling to
> > throttle_direct_reclaim() where it belongs. If the process is killed
> > while throttled, it will return immediately without direct reclaim
> > except now it will have TIF_MEMDIE set and will use the PFMEMALLOC
> > reserves.
> > 
> > Minchan pointed out that it may be better to direct reclaim before returning
> > to avoid using the reserves because there may be pages that can easily
> > reclaim that would avoid using the reserves. However, we do no such targetted
> > reclaim and there is no guarantee that suitable pages are available. As it
> > is expected that this throttling happens when swap-over-NFS is used there
> > is a possibility that the process will instead swap which may allocate
> > network buffers from the PFMEMALLOC reserves. Hence, in the swap-over-nfs
> > case where a process can be throtted and be killed it can use the reserves
> > to exit or it can potentially use reserves to swap a few pages and then
> > exit. This patch takes the option of using the reserves if necessary to
> > allow the process exit quickly.
> > 
> > If this patch passes review it should be considered a -stable candidate
> > for 3.6.
> > 
> > ...
> >
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2207,9 +2207,12 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> >   * Throttle direct reclaimers if backing storage is backed by the network
> >   * and the PFMEMALLOC reserve for the preferred node is getting dangerously
> >   * depleted. kswapd will continue to make progress and wake the processes
> > - * when the low watermark is reached
> > + * when the low watermark is reached.
> > + *
> > + * Returns true if a fatal signal was delivered during throttling. If this
> 
> s/delivered/received/imo
> 

Ok.

> > + * happens, the page allocator should not consider triggering the OOM killer.
> >   */
> > -static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> > +static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> >  					nodemask_t *nodemask)
> >  {
> >  	struct zone *zone;
> > @@ -2224,13 +2227,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> >  	 * processes to block on log_wait_commit().
> >  	 */
> >  	if (current->flags & PF_KTHREAD)
> > -		return;
> > +		goto out;
> 
> hm, well, back in the old days some kernel threads were killable via
> signals.  They had to opt-in to it by diddling their signal masks and a
> few other things.  Too lazy to check if there are still any such sites.
> 

That check is against throttling rather than signal handling though. It
could have been just left as "return".

> 
> > +	/*
> > +	 * If a fatal signal is pending, this process should not throttle.
> > +	 * It should return quickly so it can exit and free its memory
> > +	 */
> > +	if (fatal_signal_pending(current))
> > +		goto out;
> 
> theresabug.  It should return "true" here.
> 

The intention here is that a process would

1. allocate, fail, enter direct reclaim
2. no signal pending, gets throttled because of low pfmemalloc reserves
3. a user kills -9 the throttled process. returns true and goes back
   to the page allocator
4. If that allocation fails again, it re-enters direct reclaim and tries
   to throttle. This time the fatal signal is pending but we know
   we must have already failed to make the allocation so this time false
   is rurned by throttle_direct_reclaim and it tries direct reclaim.
5. direct reclaim frees something -- probably clean file-backed pages
   if the last allocation attempt had failed.

so the fatal signal check should only prevent entering direct reclaim
once. Maybe the comment sucks

/*
 * If a fatal signal is pending, this process should not throttle.
 * It should return quickly so it can exit and free its memory. Note
 * that returning false here allows a process to enter direct reclaim.
 * Otherwise there is a risk that the process loops in the page
 * allocator, checking signals and never making forward progress
 */

?

> >  
> >  	/* Check if the pfmemalloc reserves are ok */
> >  	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
> >  	pgdat = zone->zone_pgdat;
> >  	if (pfmemalloc_watermark_ok(pgdat))
> > -		return;
> > +		goto out;
> >  
> >  	/* Account for the throttling */
> >  	count_vm_event(PGSCAN_DIRECT_THROTTLE);
> > @@ -2246,12 +2256,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> >  	if (!(gfp_mask & __GFP_FS)) {
> >  		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
> >  			pfmemalloc_watermark_ok(pgdat), HZ);
> > -		return;
> > +
> > +		goto check_pending;
> 
> And this can be just an "else".
> 

ok.

> >  	}
> >  
> >  	/* Throttle until kswapd wakes the process */
> >  	wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
> >  		pfmemalloc_watermark_ok(pgdat));
> > +
> > +check_pending:
> > +	if (fatal_signal_pending(current))
> > +		return true;
> > +
> > +out:
> > +	return false;
> >  }
> >  
> >  unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> > @@ -2273,13 +2291,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> >  		.gfp_mask = sc.gfp_mask,
> >  	};
> >  
> > -	throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
> > -
> >  	/*
> > -	 * Do not enter reclaim if fatal signal is pending. 1 is returned so
> > -	 * that the page allocator does not consider triggering OOM
> > +	 * Do not enter reclaim if fatal signal was delivered while throttled.
> 
> Again, "received" is clearer.
> 

It is.

> > +	 * 1 is returned so that the page allocator does not OOM kill at this
> > +	 * point.
> >  	 */
> > -	if (fatal_signal_pending(current))
> > +	if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask))
> >  		return 1;
> >  
> >  	trace_mm_vmscan_direct_reclaim_begin(order,
> 
> So I end up with the below patch, which yields
> 
> static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> 					nodemask_t *nodemask)
> {
> 	struct zone *zone;
> 	int high_zoneidx = gfp_zone(gfp_mask);
> 	pg_data_t *pgdat;
> 
> 	/*
> 	 * Kernel threads should not be throttled as they may be indirectly
> 	 * responsible for cleaning pages necessary for reclaim to make forward
> 	 * progress. kjournald for example may enter direct reclaim while
> 	 * committing a transaction where throttling it could force other
> 	 * processes to block on log_wait_commit().
> 	 */
> 	if (current->flags & PF_KTHREAD)
> 		goto out;
> 
> 	/*
> 	 * If a fatal signal is pending, this process should not throttle.
> 	 * It should return quickly so it can exit and free its memory
> 	 */
> 	if (fatal_signal_pending(current))
> 		goto killed;
> 

I think there is a risk that this allows the page allocator to loop
in direct reclaim, always returning true here and never actually making
forward progress. Did I miss something?

> 	/* Check if the pfmemalloc reserves are ok */
> 	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
> 	pgdat = zone->zone_pgdat;
> 	if (pfmemalloc_watermark_ok(pgdat))
> 		goto out;
> 
> 	/* Account for the throttling */
> 	count_vm_event(PGSCAN_DIRECT_THROTTLE);
> 
> 	/*
> 	 * If the caller cannot enter the filesystem, it's possible that it
> 	 * is due to the caller holding an FS lock or performing a journal
> 	 * transaction in the case of a filesystem like ext[3|4]. In this case,
> 	 * it is not safe to block on pfmemalloc_wait as kswapd could be
> 	 * blocked waiting on the same lock. Instead, throttle for up to a
> 	 * second before continuing.
> 	 */
> 	if (!(gfp_mask & __GFP_FS)) {
> 		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
> 					pfmemalloc_watermark_ok(pgdat), HZ);
> 	} else {
> 		/* Throttle until kswapd wakes the process */
> 		wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
> 				    pfmemalloc_watermark_ok(pgdat));
> 	}
> 
> 	if (fatal_signal_pending(current)) {
> killed:
> 		return true;
> 	}
> 
> out:
> 	return false;
> }
> 
> (I hate that "goto killed" thing, but can't think of a better way)
> 
> --- a/mm/vmscan.c~mm-vmscan-check-for-fatal-signals-iff-the-process-was-throttled-fix
> +++ a/mm/vmscan.c
> @@ -2209,7 +2209,7 @@ static bool pfmemalloc_watermark_ok(pg_d
>   * depleted. kswapd will continue to make progress and wake the processes
>   * when the low watermark is reached.
>   *
> - * Returns true if a fatal signal was delivered during throttling. If this
> + * Returns true if a fatal signal was received during throttling.  If this
>   * happens, the page allocator should not consider triggering the OOM killer.
>   */
>  static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> @@ -2223,7 +2223,7 @@ static bool throttle_direct_reclaim(gfp_
>  	 * Kernel threads should not be throttled as they may be indirectly
>  	 * responsible for cleaning pages necessary for reclaim to make forward
>  	 * progress. kjournald for example may enter direct reclaim while
> -	 * committing a transaction where throttling it could forcing other
> +	 * committing a transaction where throttling it could force other
>  	 * processes to block on log_wait_commit().
>  	 */
>  	if (current->flags & PF_KTHREAD)
> @@ -2234,7 +2234,7 @@ static bool throttle_direct_reclaim(gfp_
>  	 * It should return quickly so it can exit and free its memory
>  	 */
>  	if (fatal_signal_pending(current))
> -		goto out;
> +		goto killed;
>  
>  	/* Check if the pfmemalloc reserves are ok */
>  	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
> @@ -2255,18 +2255,17 @@ static bool throttle_direct_reclaim(gfp_
>  	 */
>  	if (!(gfp_mask & __GFP_FS)) {
>  		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
> -			pfmemalloc_watermark_ok(pgdat), HZ);
> -
> -		goto check_pending;
> +					pfmemalloc_watermark_ok(pgdat), HZ);
> +	} else {
> +		/* Throttle until kswapd wakes the process */
> +		wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
> +				    pfmemalloc_watermark_ok(pgdat));
>  	}
>  
> -	/* Throttle until kswapd wakes the process */
> -	wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
> -		pfmemalloc_watermark_ok(pgdat));
> -
> -check_pending:
> -	if (fatal_signal_pending(current))
> +	if (fatal_signal_pending(current)) {
> +killed:
>  		return true;
> +	}
>  
>  out:
>  	return false;
> @@ -2292,7 +2291,7 @@ unsigned long try_to_free_pages(struct z
>  	};
>  
>  	/*
> -	 * Do not enter reclaim if fatal signal was delivered while throttled.
> +	 * Do not enter reclaim if a fatal signal was received while throttled.
>  	 * 1 is returned so that the page allocator does not OOM kill at this
>  	 * point.
>  	 */
> _
> 
> 
> (Still hating that "goto killed")
> 
> (relents)
> 
> How about this version?
> 
> static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> 					nodemask_t *nodemask)
> {
> 	struct zone *zone;
> 	int high_zoneidx = gfp_zone(gfp_mask);
> 	pg_data_t *pgdat;
> 
> 	/*
> 	 * Kernel threads should not be throttled as they may be indirectly
> 	 * responsible for cleaning pages necessary for reclaim to make forward
> 	 * progress. kjournald for example may enter direct reclaim while
> 	 * committing a transaction where throttling it could force other
> 	 * processes to block on log_wait_commit().
> 	 */
> 	if (current->flags & PF_KTHREAD)
> 		return false;
> 
> 	/*
> 	 * If a fatal signal is pending, this process should not throttle.
> 	 * It should return quickly so it can exit and free its memory
> 	 */
> 	if (fatal_signal_pending(current))
> 		return true;
> 

Same comment about the potential looping. Otherwise I think it's ok.

> 	/* Check if the pfmemalloc reserves are ok */
> 	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
> 	pgdat = zone->zone_pgdat;
> 	if (pfmemalloc_watermark_ok(pgdat))
> 		return false;
> 
> 	/* Account for the throttling */
> 	count_vm_event(PGSCAN_DIRECT_THROTTLE);
> 
> 	/*
> 	 * If the caller cannot enter the filesystem, it's possible that it
> 	 * is due to the caller holding an FS lock or performing a journal
> 	 * transaction in the case of a filesystem like ext[3|4]. In this case,
> 	 * it is not safe to block on pfmemalloc_wait as kswapd could be
> 	 * blocked waiting on the same lock. Instead, throttle for up to a
> 	 * second before continuing.
> 	 */
> 	if (!(gfp_mask & __GFP_FS)) {
> 		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
> 					pfmemalloc_watermark_ok(pgdat), HZ);
> 	} else {
> 		/* Throttle until kswapd wakes the process */
> 		wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
> 				    pfmemalloc_watermark_ok(pgdat));
> 	}
> 
> 	return fatal_signal_pending(current);
> }
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: vmscan: Check for fatal signals iff the process was throttled
  2012-11-21 21:05                           ` Mel Gorman
  (?)
@ 2012-11-21 21:30                           ` Andrew Morton
  -1 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2012-11-21 21:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Luigi Semenzato, linux-mm, LKML, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao, Minchan Kim

On Wed, 21 Nov 2012 21:05:20 +0000
Mel Gorman <mgorman@suse.de> wrote:

> On Wed, Nov 21, 2012 at 12:15:59PM -0800, Andrew Morton wrote:
>
> ...
>
> > > -static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> > > +static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> > >  					nodemask_t *nodemask)
> > >  {
> > >  	struct zone *zone;
> > > @@ -2224,13 +2227,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> > >  	 * processes to block on log_wait_commit().
> > >  	 */
> > >  	if (current->flags & PF_KTHREAD)
> > > -		return;
> > > +		goto out;
> > 
> > hm, well, back in the old days some kernel threads were killable via
> > signals.  They had to opt-in to it by diddling their signal masks and a
> > few other things.  Too lazy to check if there are still any such sites.
> > 
> 
> That check is against throttling rather than signal handling though. It
> could have been just left as "return".

My point is that there might still exist kernel threads which are killable
via signals.  Those threads match your criteria here: don't throttle -
just let them run to exit().

If there are indeed missed opportunities here then they will be small
ones.  And those threads probably only have signal_pending(), not
fatal_signal_pending().  Don't worry about it ;)

> > 
> > > +	/*
> > > +	 * If a fatal signal is pending, this process should not throttle.
> > > +	 * It should return quickly so it can exit and free its memory
> > > +	 */
> > > +	if (fatal_signal_pending(current))
> > > +		goto out;
> > 
> > theresabug.  It should return "true" here.
> > 
> 
> The intention here is that a process would
> 
> 1. allocate, fail, enter direct reclaim
> 2. no signal pending, gets throttled because of low pfmemalloc reserves
> 3. a user kills -9 the throttled process. returns true and goes back
>    to the page allocator
> 4. If that allocation fails again, it re-enters direct reclaim and tries
>    to throttle. This time the fatal signal is pending but we know
>    we must have already failed to make the allocation so this time false
>    is rurned by throttle_direct_reclaim and it tries direct reclaim.

My spinning head fell on the floor and is now drilling its way to China.

> 5. direct reclaim frees something -- probably clean file-backed pages
>    if the last allocation attempt had failed.
> 
> so the fatal signal check should only prevent entering direct reclaim
> once. Maybe the comment sucks

Well it did say "Returns true if a fatal signal was received during
throttling.".  That "during" was subtle.

> /*
>  * If a fatal signal is pending, this process should not throttle.
>  * It should return quickly so it can exit and free its memory. Note
>  * that returning false here allows a process to enter direct reclaim.
>  * Otherwise there is a risk that the process loops in the page
>  * allocator, checking signals and never making forward progress
>  */
> 
> ?

It's still unclear why throttle_direct_reclaim() returns false if
fatal_signal_pending() *before* throttling, but true *after* throttling. 
Why not always return true and just scram?

>
> ...
>
> Same comment about the potential looping. Otherwise I think it's ok.

Send me something sometime ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: vmscan: Check for fatal signals iff the process was throttled
  2012-11-21 15:38                       ` Mel Gorman
@ 2012-11-23  5:09                         ` Minchan Kim
  -1 siblings, 0 replies; 20+ messages in thread
From: Minchan Kim @ 2012-11-23  5:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, David Rientjes, Luigi Semenzato, linux-mm, LKML,
	Dan Magenheimer, KOSAKI Motohiro, Sonny Rao

On Wed, Nov 21, 2012 at 03:38:24PM +0000, Mel Gorman wrote:
> commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
> are low and swap is backed by network storage") introduced a check for
> fatal signals after a process gets throttled for network storage. The
> intention was that if a process was throttled and got killed that it
> should not trigger the OOM killer. As pointed out by Minchan Kim and
> David Rientjes, this check is in the wrong place and too broad. If a
> system is in am OOM situation and a process is exiting, it can loop in
> __alloc_pages_slowpath() and calling direct reclaim in a loop. As the
> fatal signal is pending it returns 1 as if it is making forward progress
> and can effectively deadlock.
> 
> This patch moves the fatal_signal_pending() check after throttling to
> throttle_direct_reclaim() where it belongs. If the process is killed
> while throttled, it will return immediately without direct reclaim
> except now it will have TIF_MEMDIE set and will use the PFMEMALLOC
> reserves.
> 
> Minchan pointed out that it may be better to direct reclaim before returning
> to avoid using the reserves because there may be pages that can easily
> reclaim that would avoid using the reserves. However, we do no such targetted
> reclaim and there is no guarantee that suitable pages are available. As it

I think we could mimic the target reclaim by checking the number of
(NR_FILE_PAGES - NR_SHMEM) and sc.may_swap = false but I am not strong now.
If some problem happens by this, we could consider this.
Now, just want to remain history in case of forgetting.

> is expected that this throttling happens when swap-over-NFS is used there
> is a possibility that the process will instead swap which may allocate
> network buffers from the PFMEMALLOC reserves. Hence, in the swap-over-nfs
> case where a process can be throtted and be killed it can use the reserves
> to exit or it can potentially use reserves to swap a few pages and then
> exit. This patch takes the option of using the reserves if necessary to
> allow the process exit quickly.
> 
> If this patch passes review it should be considered a -stable candidate
> for 3.6.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan@kernel.org>

Thanks, Mel.

> ---
>  mm/vmscan.c |   37 +++++++++++++++++++++++++++----------
>  1 file changed, 27 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 48550c6..cbf84e1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2207,9 +2207,12 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>   * Throttle direct reclaimers if backing storage is backed by the network
>   * and the PFMEMALLOC reserve for the preferred node is getting dangerously
>   * depleted. kswapd will continue to make progress and wake the processes
> - * when the low watermark is reached
> + * when the low watermark is reached.
> + *
> + * Returns true if a fatal signal was delivered during throttling. If this
> + * happens, the page allocator should not consider triggering the OOM killer.
>   */
> -static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> +static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>  					nodemask_t *nodemask)
>  {
>  	struct zone *zone;
> @@ -2224,13 +2227,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>  	 * processes to block on log_wait_commit().
>  	 */
>  	if (current->flags & PF_KTHREAD)
> -		return;
> +		goto out;
> +
> +	/*
> +	 * If a fatal signal is pending, this process should not throttle.
> +	 * It should return quickly so it can exit and free its memory
> +	 */
> +	if (fatal_signal_pending(current))
> +		goto out;
>  
>  	/* Check if the pfmemalloc reserves are ok */
>  	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
>  	pgdat = zone->zone_pgdat;
>  	if (pfmemalloc_watermark_ok(pgdat))
> -		return;
> +		goto out;
>  
>  	/* Account for the throttling */
>  	count_vm_event(PGSCAN_DIRECT_THROTTLE);
> @@ -2246,12 +2256,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>  	if (!(gfp_mask & __GFP_FS)) {
>  		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
>  			pfmemalloc_watermark_ok(pgdat), HZ);
> -		return;
> +
> +		goto check_pending;
>  	}
>  
>  	/* Throttle until kswapd wakes the process */
>  	wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
>  		pfmemalloc_watermark_ok(pgdat));
> +
> +check_pending:
> +	if (fatal_signal_pending(current))
> +		return true;
> +
> +out:
> +	return false;
>  }
>  
>  unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> @@ -2273,13 +2291,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  		.gfp_mask = sc.gfp_mask,
>  	};
>  
> -	throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
> -
>  	/*
> -	 * Do not enter reclaim if fatal signal is pending. 1 is returned so
> -	 * that the page allocator does not consider triggering OOM
> +	 * Do not enter reclaim if fatal signal was delivered while throttled.
> +	 * 1 is returned so that the page allocator does not OOM kill at this
> +	 * point.
>  	 */
> -	if (fatal_signal_pending(current))
> +	if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask))
>  		return 1;
>  
>  	trace_mm_vmscan_direct_reclaim_begin(order,
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: vmscan: Check for fatal signals iff the process was throttled
@ 2012-11-23  5:09                         ` Minchan Kim
  0 siblings, 0 replies; 20+ messages in thread
From: Minchan Kim @ 2012-11-23  5:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, David Rientjes, Luigi Semenzato, linux-mm, LKML,
	Dan Magenheimer, KOSAKI Motohiro, Sonny Rao

On Wed, Nov 21, 2012 at 03:38:24PM +0000, Mel Gorman wrote:
> commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
> are low and swap is backed by network storage") introduced a check for
> fatal signals after a process gets throttled for network storage. The
> intention was that if a process was throttled and got killed that it
> should not trigger the OOM killer. As pointed out by Minchan Kim and
> David Rientjes, this check is in the wrong place and too broad. If a
> system is in am OOM situation and a process is exiting, it can loop in
> __alloc_pages_slowpath() and calling direct reclaim in a loop. As the
> fatal signal is pending it returns 1 as if it is making forward progress
> and can effectively deadlock.
> 
> This patch moves the fatal_signal_pending() check after throttling to
> throttle_direct_reclaim() where it belongs. If the process is killed
> while throttled, it will return immediately without direct reclaim
> except now it will have TIF_MEMDIE set and will use the PFMEMALLOC
> reserves.
> 
> Minchan pointed out that it may be better to direct reclaim before returning
> to avoid using the reserves because there may be pages that can easily
> reclaim that would avoid using the reserves. However, we do no such targetted
> reclaim and there is no guarantee that suitable pages are available. As it

I think we could mimic the target reclaim by checking the number of
(NR_FILE_PAGES - NR_SHMEM) and sc.may_swap = false but I am not strong now.
If some problem happens by this, we could consider this.
Now, just want to remain history in case of forgetting.

> is expected that this throttling happens when swap-over-NFS is used there
> is a possibility that the process will instead swap which may allocate
> network buffers from the PFMEMALLOC reserves. Hence, in the swap-over-nfs
> case where a process can be throtted and be killed it can use the reserves
> to exit or it can potentially use reserves to swap a few pages and then
> exit. This patch takes the option of using the reserves if necessary to
> allow the process exit quickly.
> 
> If this patch passes review it should be considered a -stable candidate
> for 3.6.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan@kernel.org>

Thanks, Mel.

> ---
>  mm/vmscan.c |   37 +++++++++++++++++++++++++++----------
>  1 file changed, 27 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 48550c6..cbf84e1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2207,9 +2207,12 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>   * Throttle direct reclaimers if backing storage is backed by the network
>   * and the PFMEMALLOC reserve for the preferred node is getting dangerously
>   * depleted. kswapd will continue to make progress and wake the processes
> - * when the low watermark is reached
> + * when the low watermark is reached.
> + *
> + * Returns true if a fatal signal was delivered during throttling. If this
> + * happens, the page allocator should not consider triggering the OOM killer.
>   */
> -static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> +static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>  					nodemask_t *nodemask)
>  {
>  	struct zone *zone;
> @@ -2224,13 +2227,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>  	 * processes to block on log_wait_commit().
>  	 */
>  	if (current->flags & PF_KTHREAD)
> -		return;
> +		goto out;
> +
> +	/*
> +	 * If a fatal signal is pending, this process should not throttle.
> +	 * It should return quickly so it can exit and free its memory
> +	 */
> +	if (fatal_signal_pending(current))
> +		goto out;
>  
>  	/* Check if the pfmemalloc reserves are ok */
>  	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
>  	pgdat = zone->zone_pgdat;
>  	if (pfmemalloc_watermark_ok(pgdat))
> -		return;
> +		goto out;
>  
>  	/* Account for the throttling */
>  	count_vm_event(PGSCAN_DIRECT_THROTTLE);
> @@ -2246,12 +2256,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>  	if (!(gfp_mask & __GFP_FS)) {
>  		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
>  			pfmemalloc_watermark_ok(pgdat), HZ);
> -		return;
> +
> +		goto check_pending;
>  	}
>  
>  	/* Throttle until kswapd wakes the process */
>  	wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
>  		pfmemalloc_watermark_ok(pgdat));
> +
> +check_pending:
> +	if (fatal_signal_pending(current))
> +		return true;
> +
> +out:
> +	return false;
>  }
>  
>  unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> @@ -2273,13 +2291,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  		.gfp_mask = sc.gfp_mask,
>  	};
>  
> -	throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
> -
>  	/*
> -	 * Do not enter reclaim if fatal signal is pending. 1 is returned so
> -	 * that the page allocator does not consider triggering OOM
> +	 * Do not enter reclaim if fatal signal was delivered while throttled.
> +	 * 1 is returned so that the page allocator does not OOM kill at this
> +	 * point.
>  	 */
> -	if (fatal_signal_pending(current))
> +	if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask))
>  		return 1;
>  
>  	trace_mm_vmscan_direct_reclaim_begin(order,
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2012-11-23  5:08 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-02  6:39 zram OOM behavior Minchan Kim
2012-11-02  8:30 ` Mel Gorman
2012-11-02 22:36   ` Minchan Kim
2012-11-05 14:46     ` Mel Gorman
2012-11-06  0:25       ` Minchan Kim
2012-11-06  8:58         ` Mel Gorman
2012-11-06 10:17           ` Minchan Kim
2012-11-09  9:50             ` Mel Gorman
2012-11-12 13:32               ` Minchan Kim
2012-11-12 14:06                 ` Mel Gorman
2012-11-13 13:31                   ` Minchan Kim
2012-11-21 15:38                     ` [PATCH] mm: vmscan: Check for fatal signals iff the process was throttled Mel Gorman
2012-11-21 15:38                       ` Mel Gorman
2012-11-21 20:15                       ` Andrew Morton
2012-11-21 20:15                         ` Andrew Morton
2012-11-21 21:05                         ` Mel Gorman
2012-11-21 21:05                           ` Mel Gorman
2012-11-21 21:30                           ` Andrew Morton
2012-11-23  5:09                       ` Minchan Kim
2012-11-23  5:09                         ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.